2025-05-07T19:40:44.6473666Z Current runner version: '2.323.0' 2025-05-07T19:40:44.6479476Z Runner name: 'i-0ca7df191fe9703a4' 2025-05-07T19:40:44.6480487Z Machine name: 'ip-10-0-5-85' 2025-05-07T19:40:44.6483095Z ##[group]GITHUB_TOKEN Permissions 2025-05-07T19:40:44.6485669Z Contents: read 2025-05-07T19:40:44.6486200Z Metadata: read 2025-05-07T19:40:44.6486700Z ##[endgroup] 2025-05-07T19:40:44.6489029Z Secret source: None 2025-05-07T19:40:44.6490078Z Prepare workflow directory 2025-05-07T19:40:44.7180381Z Prepare all required actions 2025-05-07T19:40:44.7227346Z Getting action download info 2025-05-07T19:40:44.8939846Z Download action repository 'actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T19:40:45.1553996Z Download action repository 'actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-05-07T19:40:45.6869648Z Complete job name: pytorch/FBGEMM / build-manywheel-py3_9-cuda12_8 2025-05-07T19:40:45.7584656Z A job started hook has been configured by the self-hosted runner administrator 2025-05-07T19:40:45.7760272Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-05-07T19:40:45.7773840Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T19:40:45.7775536Z ##[endgroup] 2025-05-07T19:40:47.6193009Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-05-07T19:40:47.6193480Z Instance Type: g5.4xlarge 2025-05-07T19:40:47.6193718Z AMI Name: unknown 2025-05-07T19:40:47.6232327Z AMI ID: ami-071226ecf16aa7d96 2025-05-07T19:40:53.0033443Z ##[group]Checking docker version 2025-05-07T19:40:53.0045337Z ##[command]/usr/bin/docker version --format '{{.Server.APIVersion}}' 2025-05-07T19:40:53.0248582Z '1.44' 2025-05-07T19:40:53.0278525Z Docker daemon API version: '1.44' 2025-05-07T19:40:53.0279011Z ##[command]/usr/bin/docker version --format '{{.Client.APIVersion}}' 2025-05-07T19:40:53.0470096Z '1.44' 2025-05-07T19:40:53.0490686Z Docker client API version: '1.44' 2025-05-07T19:40:53.0495380Z ##[endgroup] 2025-05-07T19:40:53.0497801Z ##[group]Clean up resources from previous jobs 2025-05-07T19:40:53.0502936Z ##[command]/usr/bin/docker ps --all --quiet --no-trunc --filter "label=1ad978" 2025-05-07T19:40:53.0647040Z ##[command]/usr/bin/docker network prune --force --filter "label=1ad978" 2025-05-07T19:40:53.0802998Z ##[endgroup] 2025-05-07T19:40:53.0803246Z ##[group]Create local container network 2025-05-07T19:40:53.0811517Z ##[command]/usr/bin/docker network create --label 1ad978 github_network_647c08a59ff240119465636525774298 2025-05-07T19:40:53.3513876Z 24f678d7e6889a1e004ba7f463e2596ff4f606a6f60051d23c9aab6318cf346a 2025-05-07T19:40:53.3541296Z ##[endgroup] 2025-05-07T19:40:53.3563573Z ##[group]Starting job container 2025-05-07T19:40:53.3581999Z ##[command]/usr/bin/docker pull pytorch/manylinux2_28-builder:cuda12.8 2025-05-07T19:40:53.5525004Z cuda12.8: Pulling from pytorch/manylinux2_28-builder 2025-05-07T19:40:53.5525593Z 09720f817e0c: Pulling fs layer 2025-05-07T19:40:53.5525994Z 2515086f3df4: Pulling fs layer 2025-05-07T19:40:53.5526396Z 40bd9142b82e: Pulling fs layer 2025-05-07T19:40:53.5526739Z 04ff279811b1: Pulling fs layer 2025-05-07T19:40:53.5527071Z 1ecc152f7174: Pulling fs layer 2025-05-07T19:40:53.5527408Z a5e9d3452d94: Pulling fs layer 2025-05-07T19:40:53.5527765Z fd79b2bfc106: Pulling fs layer 2025-05-07T19:40:53.5528118Z dfc73d088441: Pulling fs layer 2025-05-07T19:40:53.5528449Z 049535df493b: Pulling fs layer 2025-05-07T19:40:53.5528794Z dbe1603ccc47: Pulling fs layer 2025-05-07T19:40:53.5529176Z aeaa64568845: Pulling fs layer 2025-05-07T19:40:53.5529514Z 771eeddba98c: Pulling fs layer 2025-05-07T19:40:53.5529835Z 926a41625fca: Pulling fs layer 2025-05-07T19:40:53.5530077Z 6b8cb4269f2e: Pulling fs layer 2025-05-07T19:40:53.5530330Z 5122e5ed1720: Pulling fs layer 2025-05-07T19:40:53.5530571Z 797f43573cbe: Pulling fs layer 2025-05-07T19:40:53.5530822Z 075eacd4adce: Pulling fs layer 2025-05-07T19:40:53.5531159Z 5924a796f495: Pulling fs layer 2025-05-07T19:40:53.5531443Z f6ec9d98ad10: Pulling fs layer 2025-05-07T19:40:53.5532061Z ff887ee56f5e: Pulling fs layer 2025-05-07T19:40:53.5532316Z 2b47deca1220: Pulling fs layer 2025-05-07T19:40:53.5532577Z 48558664882c: Pulling fs layer 2025-05-07T19:40:53.5532820Z 39220f950313: Pulling fs layer 2025-05-07T19:40:53.5533068Z 2c16fc959be4: Pulling fs layer 2025-05-07T19:40:53.5533328Z 4f4fb700ef54: Pulling fs layer 2025-05-07T19:40:53.5533584Z dbb77404d40f: Pulling fs layer 2025-05-07T19:40:53.5533827Z 35915bf195e7: Pulling fs layer 2025-05-07T19:40:53.5534079Z 3d17c495d8c9: Pulling fs layer 2025-05-07T19:40:53.5534315Z 1ecc152f7174: Waiting 2025-05-07T19:40:53.5534527Z a5e9d3452d94: Waiting 2025-05-07T19:40:53.5534729Z fd79b2bfc106: Waiting 2025-05-07T19:40:53.5534938Z dfc73d088441: Waiting 2025-05-07T19:40:53.5535215Z 049535df493b: Waiting 2025-05-07T19:40:53.5535416Z dbe1603ccc47: Waiting 2025-05-07T19:40:53.5535624Z 04ff279811b1: Waiting 2025-05-07T19:40:53.5535822Z ff887ee56f5e: Waiting 2025-05-07T19:40:53.5536032Z aeaa64568845: Waiting 2025-05-07T19:40:53.5536244Z 771eeddba98c: Waiting 2025-05-07T19:40:53.5536455Z 2c16fc959be4: Waiting 2025-05-07T19:40:53.5536659Z dbb77404d40f: Waiting 2025-05-07T19:40:53.5536863Z 926a41625fca: Waiting 2025-05-07T19:40:53.5537072Z 4f4fb700ef54: Waiting 2025-05-07T19:40:53.5537270Z 2b47deca1220: Waiting 2025-05-07T19:40:53.5537474Z 35915bf195e7: Waiting 2025-05-07T19:40:53.5537671Z 6b8cb4269f2e: Waiting 2025-05-07T19:40:53.5537876Z 3d17c495d8c9: Waiting 2025-05-07T19:40:53.5538069Z 39220f950313: Waiting 2025-05-07T19:40:53.5538433Z 5122e5ed1720: Waiting 2025-05-07T19:40:53.5538634Z 797f43573cbe: Waiting 2025-05-07T19:40:53.5538841Z 075eacd4adce: Waiting 2025-05-07T19:40:53.5539045Z 5924a796f495: Waiting 2025-05-07T19:40:53.5539251Z f6ec9d98ad10: Waiting 2025-05-07T19:40:54.2373723Z 40bd9142b82e: Verifying Checksum 2025-05-07T19:40:54.2374190Z 40bd9142b82e: Download complete 2025-05-07T19:40:54.2891564Z 09720f817e0c: Verifying Checksum 2025-05-07T19:40:54.2892010Z 09720f817e0c: Download complete 2025-05-07T19:40:54.3106074Z 2515086f3df4: Verifying Checksum 2025-05-07T19:40:54.3106650Z 2515086f3df4: Download complete 2025-05-07T19:40:54.5339936Z 1ecc152f7174: Verifying Checksum 2025-05-07T19:40:54.5340253Z 1ecc152f7174: Download complete 2025-05-07T19:40:54.5822028Z fd79b2bfc106: Verifying Checksum 2025-05-07T19:40:54.5822558Z fd79b2bfc106: Download complete 2025-05-07T19:40:54.7760275Z dfc73d088441: Verifying Checksum 2025-05-07T19:40:54.7760809Z dfc73d088441: Download complete 2025-05-07T19:40:54.8220289Z 049535df493b: Verifying Checksum 2025-05-07T19:40:54.8220782Z 049535df493b: Download complete 2025-05-07T19:40:55.0225003Z a5e9d3452d94: Verifying Checksum 2025-05-07T19:40:55.0225707Z a5e9d3452d94: Download complete 2025-05-07T19:40:55.1277573Z aeaa64568845: Download complete 2025-05-07T19:40:56.2149214Z 04ff279811b1: Verifying Checksum 2025-05-07T19:40:56.2149651Z 04ff279811b1: Download complete 2025-05-07T19:40:56.3503697Z 926a41625fca: Download complete 2025-05-07T19:40:56.4389652Z 6b8cb4269f2e: Download complete 2025-05-07T19:40:56.4932391Z 5122e5ed1720: Verifying Checksum 2025-05-07T19:40:56.4932873Z 5122e5ed1720: Download complete 2025-05-07T19:40:56.5468189Z 797f43573cbe: Verifying Checksum 2025-05-07T19:40:56.5468823Z 797f43573cbe: Download complete 2025-05-07T19:40:56.6315481Z 075eacd4adce: Verifying Checksum 2025-05-07T19:40:56.6315896Z 075eacd4adce: Download complete 2025-05-07T19:40:56.8772048Z 5924a796f495: Verifying Checksum 2025-05-07T19:40:56.8772356Z 5924a796f495: Download complete 2025-05-07T19:40:56.9074450Z dbe1603ccc47: Verifying Checksum 2025-05-07T19:40:56.9074975Z dbe1603ccc47: Download complete 2025-05-07T19:40:56.9578615Z f6ec9d98ad10: Download complete 2025-05-07T19:40:56.9720137Z 09720f817e0c: Pull complete 2025-05-07T19:40:56.9829669Z ff887ee56f5e: Verifying Checksum 2025-05-07T19:40:56.9830264Z ff887ee56f5e: Download complete 2025-05-07T19:40:57.0249884Z 2b47deca1220: Verifying Checksum 2025-05-07T19:40:57.2873571Z 2b47deca1220: Download complete 2025-05-07T19:40:57.2873894Z 39220f950313: Verifying Checksum 2025-05-07T19:40:57.2874516Z 39220f950313: Download complete 2025-05-07T19:40:57.4655665Z 2515086f3df4: Pull complete 2025-05-07T19:40:57.7911216Z 2c16fc959be4: Verifying Checksum 2025-05-07T19:40:57.7911536Z 2c16fc959be4: Download complete 2025-05-07T19:40:57.8411173Z 4f4fb700ef54: Download complete 2025-05-07T19:40:58.0658945Z 771eeddba98c: Verifying Checksum 2025-05-07T19:40:58.0659494Z 771eeddba98c: Download complete 2025-05-07T19:40:58.5742467Z 40bd9142b82e: Pull complete 2025-05-07T19:41:00.2016637Z 35915bf195e7: Verifying Checksum 2025-05-07T19:41:00.2016943Z 35915bf195e7: Download complete 2025-05-07T19:41:00.2557371Z 3d17c495d8c9: Download complete 2025-05-07T19:41:02.0909260Z 48558664882c: Verifying Checksum 2025-05-07T19:41:02.0909690Z 48558664882c: Download complete 2025-05-07T19:41:03.5876438Z 04ff279811b1: Pull complete 2025-05-07T19:41:03.7815267Z 1ecc152f7174: Pull complete 2025-05-07T19:41:04.3845623Z a5e9d3452d94: Pull complete 2025-05-07T19:41:04.4065785Z fd79b2bfc106: Pull complete 2025-05-07T19:41:04.5724789Z dfc73d088441: Pull complete 2025-05-07T19:41:04.5985350Z 049535df493b: Pull complete 2025-05-07T19:41:17.0524772Z dbe1603ccc47: Pull complete 2025-05-07T19:41:17.2046888Z aeaa64568845: Pull complete 2025-05-07T19:41:20.5327580Z 771eeddba98c: Pull complete 2025-05-07T19:41:20.6446323Z 926a41625fca: Pull complete 2025-05-07T19:41:20.7800012Z 6b8cb4269f2e: Pull complete 2025-05-07T19:41:20.8688743Z 5122e5ed1720: Pull complete 2025-05-07T19:41:20.9340467Z 797f43573cbe: Pull complete 2025-05-07T19:41:20.9898976Z 075eacd4adce: Pull complete 2025-05-07T19:41:21.1056717Z 5924a796f495: Pull complete 2025-05-07T19:41:21.1525668Z f6ec9d98ad10: Pull complete 2025-05-07T19:41:21.2239025Z ff887ee56f5e: Pull complete 2025-05-07T19:41:21.3160099Z 2b47deca1220: Pull complete 2025-05-07T19:41:35.9111878Z 48558664882c: Pull complete 2025-05-07T19:41:37.5985580Z 39220f950313: Pull complete 2025-05-07T19:41:39.3054618Z 2c16fc959be4: Pull complete 2025-05-07T19:41:39.4550314Z 4f4fb700ef54: Pull complete 2025-05-07T19:42:52.0554044Z dbb77404d40f: Verifying Checksum 2025-05-07T19:42:52.0556614Z dbb77404d40f: Download complete 2025-05-07T19:43:57.6412295Z dbb77404d40f: Pull complete 2025-05-07T19:43:59.4428489Z 35915bf195e7: Pull complete 2025-05-07T19:43:59.5784608Z 3d17c495d8c9: Pull complete 2025-05-07T19:43:59.6227418Z Digest: sha256:ecd25bd1885dafc31d7d12c37a98da9e7e5b0f2908cd766fa8517e742d71cd8c 2025-05-07T19:43:59.6345847Z Status: Downloaded newer image for pytorch/manylinux2_28-builder:cuda12.8 2025-05-07T19:43:59.6470863Z docker.io/pytorch/manylinux2_28-builder:cuda12.8 2025-05-07T19:43:59.6565898Z ##[command]/usr/bin/docker create --name c094c8ba304b42fabfe3662b3a2b7dff_pytorchmanylinux2_28buildercuda128_5a87e8 --label 1ad978 --workdir /__w/FBGEMM/FBGEMM --network github_network_647c08a59ff240119465636525774298 --gpus all -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/ec2-user/actions-runner/_work":"/__w" -v "/home/ec2-user/actions-runner/externals":"/__e":ro -v "/home/ec2-user/actions-runner/_work/_temp":"/__w/_temp" -v "/home/ec2-user/actions-runner/_work/_actions":"/__w/_actions" -v "/home/ec2-user/actions-runner/_work/_tool":"/__w/_tool" -v "/home/ec2-user/actions-runner/_work/_temp/_github_home":"/github/home" -v "/home/ec2-user/actions-runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" pytorch/manylinux2_28-builder:cuda12.8 "-f" "/dev/null" 2025-05-07T19:45:02.7089966Z d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 2025-05-07T19:45:02.7122399Z ##[command]/usr/bin/docker start d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 2025-05-07T19:45:03.2704131Z d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 2025-05-07T19:45:03.2730047Z ##[command]/usr/bin/docker ps --all --filter id=d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 --filter status=running --no-trunc --format "{{.ID}} {{.Status}}" 2025-05-07T19:45:03.2866765Z d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 Up Less than a second 2025-05-07T19:45:03.2896531Z ##[command]/usr/bin/docker inspect --format "{{range .Config.Env}}{{println .}}{{end}}" d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 2025-05-07T19:45:03.3041166Z GITHUB_ACTIONS=true 2025-05-07T19:45:03.3043975Z CI=true 2025-05-07T19:45:03.3044292Z HOME=/github/home 2025-05-07T19:45:03.3045131Z PATH=/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:45:03.3046036Z LC_ALL=en_US.UTF-8 2025-05-07T19:45:03.3047212Z LANG=en_US.UTF-8 2025-05-07T19:45:03.3047495Z LANGUAGE=en_US.UTF-8 2025-05-07T19:45:03.3047841Z SSL_CERT_FILE=/opt/_internal/certs.pem 2025-05-07T19:45:03.3048476Z LD_LIBRARY_PATH=/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: 2025-05-07T19:45:03.3068342Z ##[endgroup] 2025-05-07T19:45:03.3077284Z ##[group]Waiting for all services to be ready 2025-05-07T19:45:03.3078725Z ##[endgroup] 2025-05-07T19:45:03.3167691Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:03.3168049Z set -euxo pipefail 2025-05-07T19:45:03.3168348Z echo "::group::Cleanup debug output" 2025-05-07T19:45:03.3168673Z rm -rf "${GITHUB_WORKSPACE}" 2025-05-07T19:45:03.3168963Z mkdir -p "${GITHUB_WORKSPACE}" 2025-05-07T19:45:03.3169234Z  2025-05-07T19:45:03.3169435Z if [[ "x86_64" = "aarch64" ]]; then 2025-05-07T19:45:03.3169728Z  rm -rf "${RUNNER_TEMP}/*" 2025-05-07T19:45:03.3169977Z fi 2025-05-07T19:45:03.3170179Z echo "::endgroup::" 2025-05-07T19:45:03.3170502Z shell: bash -l {0} 2025-05-07T19:45:03.3170693Z env: 2025-05-07T19:45:03.3170878Z PYTHON_VERSION: 3.9 2025-05-07T19:45:03.3171095Z PACKAGE_TYPE: wheel 2025-05-07T19:45:03.3171331Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:03.3171562Z REF: 2025-05-07T19:45:03.3171738Z CU_VERSION: cu128 2025-05-07T19:45:03.3171950Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:03.3172228Z ARCH: x86_64 2025-05-07T19:45:03.3172425Z BUILD_TARGET: genai 2025-05-07T19:45:03.3172636Z ##[endgroup] 2025-05-07T19:45:03.5890673Z + echo '::group::Cleanup debug output' 2025-05-07T19:45:03.5897170Z ##[group]Cleanup debug output 2025-05-07T19:45:03.5897487Z + rm -rf /__w/FBGEMM/FBGEMM 2025-05-07T19:45:03.5905632Z + mkdir -p /__w/FBGEMM/FBGEMM 2025-05-07T19:45:03.5932638Z + [[ x86_64 = \a\a\r\c\h\6\4 ]] 2025-05-07T19:45:03.5932900Z + echo ::endgroup:: 2025-05-07T19:45:03.5933358Z ##[endgroup] 2025-05-07T19:45:03.6074391Z ##[group]Run actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 2025-05-07T19:45:03.6074786Z with: 2025-05-07T19:45:03.6074993Z repository: pytorch/test-infra 2025-05-07T19:45:03.6075241Z ref: main 2025-05-07T19:45:03.6075423Z path: test-infra 2025-05-07T19:45:03.6075751Z token: *** 2025-05-07T19:45:03.6075943Z ssh-strict: true 2025-05-07T19:45:03.6076163Z ssh-user: git 2025-05-07T19:45:03.6076403Z persist-credentials: true 2025-05-07T19:45:03.6076658Z clean: true 2025-05-07T19:45:03.6076875Z sparse-checkout-cone-mode: true 2025-05-07T19:45:03.6077138Z fetch-depth: 1 2025-05-07T19:45:03.6077343Z fetch-tags: false 2025-05-07T19:45:03.6077545Z show-progress: true 2025-05-07T19:45:03.6077757Z lfs: false 2025-05-07T19:45:03.6077941Z submodules: false 2025-05-07T19:45:03.6078160Z set-safe-directory: true 2025-05-07T19:45:03.6078378Z env: 2025-05-07T19:45:03.6078561Z PYTHON_VERSION: 3.9 2025-05-07T19:45:03.6078775Z PACKAGE_TYPE: wheel 2025-05-07T19:45:03.6079047Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:03.6079308Z REF: 2025-05-07T19:45:03.6079524Z CU_VERSION: cu128 2025-05-07T19:45:03.6079732Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:03.6080082Z ARCH: x86_64 2025-05-07T19:45:03.6080379Z BUILD_TARGET: genai 2025-05-07T19:45:03.6080652Z ##[endgroup] 2025-05-07T19:45:03.6121533Z ##[command]/usr/bin/docker exec d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 sh -c "cat /etc/*release | grep ^ID" 2025-05-07T19:45:03.9541284Z Syncing repository: pytorch/test-infra 2025-05-07T19:45:03.9541882Z ##[group]Getting Git version info 2025-05-07T19:45:03.9542230Z Working directory is '/__w/FBGEMM/FBGEMM/test-infra' 2025-05-07T19:45:03.9542757Z [command]/opt/conda/bin/git version 2025-05-07T19:45:03.9552329Z git version 2.40.1 2025-05-07T19:45:03.9577936Z ##[endgroup] 2025-05-07T19:45:03.9593813Z Temporarily overriding HOME='/__w/_temp/892e0dd9-4e3c-4536-9cc4-9c3482e0af40' before making global git config changes 2025-05-07T19:45:03.9594970Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T19:45:03.9599368Z [command]/opt/conda/bin/git config --global --add safe.directory /__w/FBGEMM/FBGEMM/test-infra 2025-05-07T19:45:03.9641709Z ##[group]Initializing the repository 2025-05-07T19:45:03.9645958Z [command]/opt/conda/bin/git init /__w/FBGEMM/FBGEMM/test-infra 2025-05-07T19:45:03.9688201Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T19:45:03.9688995Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T19:45:03.9689624Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T19:45:03.9689987Z hint: 2025-05-07T19:45:03.9690326Z hint: git config --global init.defaultBranch 2025-05-07T19:45:03.9690641Z hint: 2025-05-07T19:45:03.9690960Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T19:45:03.9691503Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T19:45:03.9691897Z hint: 2025-05-07T19:45:03.9692120Z hint: git branch -m 2025-05-07T19:45:03.9694458Z Initialized empty Git repository in /__w/FBGEMM/FBGEMM/test-infra/.git/ 2025-05-07T19:45:03.9706037Z [command]/opt/conda/bin/git remote add origin https://github.com/pytorch/test-infra 2025-05-07T19:45:03.9744506Z ##[endgroup] 2025-05-07T19:45:03.9745024Z ##[group]Disabling automatic garbage collection 2025-05-07T19:45:03.9748573Z [command]/opt/conda/bin/git config --local gc.auto 0 2025-05-07T19:45:03.9784167Z ##[endgroup] 2025-05-07T19:45:03.9784543Z ##[group]Setting up auth 2025-05-07T19:45:03.9791061Z [command]/opt/conda/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T19:45:03.9827766Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T19:45:04.0305198Z [command]/opt/conda/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T19:45:04.0341679Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T19:45:04.0817442Z [command]/opt/conda/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T19:45:04.0869971Z ##[endgroup] 2025-05-07T19:45:04.0870380Z ##[group]Fetching the repository 2025-05-07T19:45:04.0877795Z [command]/opt/conda/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +refs/heads/main*:refs/remotes/origin/main* +refs/tags/main*:refs/tags/main* 2025-05-07T19:45:04.6961090Z From https://github.com/pytorch/test-infra 2025-05-07T19:45:04.6961488Z * [new branch] main -> origin/main 2025-05-07T19:45:04.6991531Z ##[endgroup] 2025-05-07T19:45:04.6992044Z ##[group]Determining the checkout info 2025-05-07T19:45:04.6997756Z [command]/opt/conda/bin/git branch --list --remote origin/main 2025-05-07T19:45:04.7028889Z origin/main 2025-05-07T19:45:04.7035707Z ##[endgroup] 2025-05-07T19:45:04.7041022Z [command]/opt/conda/bin/git sparse-checkout disable 2025-05-07T19:45:04.7084352Z [command]/opt/conda/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T19:45:04.7117752Z ##[group]Checking out the ref 2025-05-07T19:45:04.7121402Z [command]/opt/conda/bin/git checkout --progress --force -B main refs/remotes/origin/main 2025-05-07T19:45:04.8263450Z Switched to a new branch 'main' 2025-05-07T19:45:04.8263748Z branch 'main' set up to track 'origin/main'. 2025-05-07T19:45:04.8275031Z ##[endgroup] 2025-05-07T19:45:04.8322830Z [command]/opt/conda/bin/git log -1 --format=%H 2025-05-07T19:45:04.8356274Z 117fccdf5892ff9a958d2afb4b4b8b6e930d3187 2025-05-07T19:45:04.8605257Z Prepare all required actions 2025-05-07T19:45:04.8654667Z ##[group]Run ./test-infra/.github/actions/set-channel 2025-05-07T19:45:04.8654981Z env: 2025-05-07T19:45:04.8655175Z PYTHON_VERSION: 3.9 2025-05-07T19:45:04.8655396Z PACKAGE_TYPE: wheel 2025-05-07T19:45:04.8655633Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:04.8655866Z REF: 2025-05-07T19:45:04.8656049Z CU_VERSION: cu128 2025-05-07T19:45:04.8656262Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:04.8656501Z ARCH: x86_64 2025-05-07T19:45:04.8656732Z BUILD_TARGET: genai 2025-05-07T19:45:04.8656957Z ##[endgroup] 2025-05-07T19:45:04.8750982Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:04.8751274Z set -euxo pipefail 2025-05-07T19:45:04.8751575Z echo "CHANNEL=nightly" >> "${GITHUB_ENV}" 2025-05-07T19:45:04.8752111Z shell: bash --noprofile --norc -e -o pipefail {0} 2025-05-07T19:45:04.8752412Z env: 2025-05-07T19:45:04.8752594Z PYTHON_VERSION: 3.9 2025-05-07T19:45:04.8752854Z PACKAGE_TYPE: wheel 2025-05-07T19:45:04.8753085Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:04.8753324Z REF: 2025-05-07T19:45:04.8753499Z CU_VERSION: cu128 2025-05-07T19:45:04.8753720Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:04.8753947Z ARCH: x86_64 2025-05-07T19:45:04.8754149Z BUILD_TARGET: genai 2025-05-07T19:45:04.8754365Z ##[endgroup] 2025-05-07T19:45:05.0194537Z + echo CHANNEL=nightly 2025-05-07T19:45:05.0349972Z Prepare all required actions 2025-05-07T19:45:05.0350307Z Getting action download info 2025-05-07T19:45:05.1518490Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-05-07T19:45:05.4155748Z Download action repository 'conda-incubator/setup-miniconda@v3.1.1' (SHA:505e6394dae86d6a5c7fbb6e3fb8938e3e863830) 2025-05-07T19:45:05.6553679Z ##[group]Run ./test-infra/.github/actions/setup-binary-builds 2025-05-07T19:45:05.6554021Z with: 2025-05-07T19:45:05.6554226Z repository: pytorch/FBGEMM 2025-05-07T19:45:05.6554473Z submodules: recursive 2025-05-07T19:45:05.6554709Z setup-miniconda: false 2025-05-07T19:45:05.6554940Z python-version: 3.9 2025-05-07T19:45:05.6555164Z cuda-version: cu128 2025-05-07T19:45:05.6555368Z arch: x86_64 2025-05-07T19:45:05.6555581Z upload-to-base-bucket: no 2025-05-07T19:45:05.6555805Z env: 2025-05-07T19:45:05.6555983Z PYTHON_VERSION: 3.9 2025-05-07T19:45:05.6556204Z PACKAGE_TYPE: wheel 2025-05-07T19:45:05.6556458Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:05.6556689Z REF: 2025-05-07T19:45:05.6556891Z CU_VERSION: cu128 2025-05-07T19:45:05.6557125Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:05.6557362Z ARCH: x86_64 2025-05-07T19:45:05.6557555Z BUILD_TARGET: genai 2025-05-07T19:45:05.6557764Z CHANNEL: nightly 2025-05-07T19:45:05.6557968Z PLATFORM: 2025-05-07T19:45:05.6558147Z ##[endgroup] 2025-05-07T19:45:05.6588455Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:05.6588744Z set -euxo pipefail 2025-05-07T19:45:05.6589009Z rm -rf "${REPOSITORY}" 2025-05-07T19:45:05.6589384Z shell: bash --noprofile --norc -e -o pipefail {0} 2025-05-07T19:45:05.6589686Z env: 2025-05-07T19:45:05.6589873Z PYTHON_VERSION: 3.9 2025-05-07T19:45:05.6590104Z PACKAGE_TYPE: wheel 2025-05-07T19:45:05.6590341Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:05.6590577Z REF: 2025-05-07T19:45:05.6590760Z CU_VERSION: cu128 2025-05-07T19:45:05.6590977Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:05.6591223Z ARCH: x86_64 2025-05-07T19:45:05.6591421Z BUILD_TARGET: genai 2025-05-07T19:45:05.6591640Z CHANNEL: nightly 2025-05-07T19:45:05.6591841Z PLATFORM: 2025-05-07T19:45:05.6592131Z ##[endgroup] 2025-05-07T19:45:05.8197080Z + rm -rf pytorch/FBGEMM 2025-05-07T19:45:05.8322668Z ##[group]Run actions/checkout@v4 2025-05-07T19:45:05.8322923Z with: 2025-05-07T19:45:05.8323131Z repository: pytorch/FBGEMM 2025-05-07T19:45:05.8323383Z submodules: recursive 2025-05-07T19:45:05.8323617Z path: pytorch/FBGEMM 2025-05-07T19:45:05.8323942Z token: *** 2025-05-07T19:45:05.8324143Z ssh-strict: true 2025-05-07T19:45:05.8324353Z ssh-user: git 2025-05-07T19:45:05.8324568Z persist-credentials: true 2025-05-07T19:45:05.8324814Z clean: true 2025-05-07T19:45:05.8325036Z sparse-checkout-cone-mode: true 2025-05-07T19:45:05.8325307Z fetch-depth: 1 2025-05-07T19:45:05.8325511Z fetch-tags: false 2025-05-07T19:45:05.8325728Z show-progress: true 2025-05-07T19:45:05.8325940Z lfs: false 2025-05-07T19:45:05.8326163Z set-safe-directory: true 2025-05-07T19:45:05.8326387Z env: 2025-05-07T19:45:05.8326580Z PYTHON_VERSION: 3.9 2025-05-07T19:45:05.8326825Z PACKAGE_TYPE: wheel 2025-05-07T19:45:05.8327097Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:05.8327336Z REF: 2025-05-07T19:45:05.8327515Z CU_VERSION: cu128 2025-05-07T19:45:05.8327741Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:05.8327970Z ARCH: x86_64 2025-05-07T19:45:05.8328173Z BUILD_TARGET: genai 2025-05-07T19:45:05.8328386Z CHANNEL: nightly 2025-05-07T19:45:05.8328590Z PLATFORM: 2025-05-07T19:45:05.8328779Z ##[endgroup] 2025-05-07T19:45:05.8333051Z ##[command]/usr/bin/docker exec d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 sh -c "cat /etc/*release | grep ^ID" 2025-05-07T19:45:06.0431035Z Syncing repository: pytorch/FBGEMM 2025-05-07T19:45:06.0437723Z ##[group]Getting Git version info 2025-05-07T19:45:06.0438094Z Working directory is '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM' 2025-05-07T19:45:06.0472282Z [command]/opt/conda/bin/git version 2025-05-07T19:45:06.0512286Z git version 2.40.1 2025-05-07T19:45:06.0536534Z ##[endgroup] 2025-05-07T19:45:06.0559916Z Temporarily overriding HOME='/__w/_temp/ff85448b-7baf-4205-9579-923ae6e6b17c' before making global git config changes 2025-05-07T19:45:06.0560664Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T19:45:06.0564785Z [command]/opt/conda/bin/git config --global --add safe.directory /__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:45:06.0602579Z ##[group]Initializing the repository 2025-05-07T19:45:06.0606505Z [command]/opt/conda/bin/git init /__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:45:06.0645086Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-05-07T19:45:06.0645859Z hint: is subject to change. To configure the initial branch name to use in all 2025-05-07T19:45:06.0646597Z hint: of your new repositories, which will suppress this warning, call: 2025-05-07T19:45:06.0647187Z hint: 2025-05-07T19:45:06.0647541Z hint: git config --global init.defaultBranch 2025-05-07T19:45:06.0647967Z hint: 2025-05-07T19:45:06.0648405Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-05-07T19:45:06.0649144Z hint: 'development'. The just-created branch can be renamed via this command: 2025-05-07T19:45:06.0649681Z hint: 2025-05-07T19:45:06.0649957Z hint: git branch -m 2025-05-07T19:45:06.0651343Z Initialized empty Git repository in /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/ 2025-05-07T19:45:06.0662387Z [command]/opt/conda/bin/git remote add origin https://github.com/pytorch/FBGEMM 2025-05-07T19:45:06.0697843Z ##[endgroup] 2025-05-07T19:45:06.0698229Z ##[group]Disabling automatic garbage collection 2025-05-07T19:45:06.0702055Z [command]/opt/conda/bin/git config --local gc.auto 0 2025-05-07T19:45:06.0735578Z ##[endgroup] 2025-05-07T19:45:06.0735957Z ##[group]Setting up auth 2025-05-07T19:45:06.0741414Z [command]/opt/conda/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T19:45:06.0775469Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T19:45:06.1236251Z [command]/opt/conda/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T19:45:06.1271448Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T19:45:06.1733891Z [command]/opt/conda/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T19:45:06.1785794Z ##[endgroup] 2025-05-07T19:45:06.1786183Z ##[group]Fetching the repository 2025-05-07T19:45:06.1794041Z [command]/opt/conda/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge 2025-05-07T19:45:06.4651938Z From https://github.com/pytorch/FBGEMM 2025-05-07T19:45:06.4652414Z * [new ref] a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge 2025-05-07T19:45:06.4691187Z ##[endgroup] 2025-05-07T19:45:06.4691561Z ##[group]Determining the checkout info 2025-05-07T19:45:06.4693665Z ##[endgroup] 2025-05-07T19:45:06.4698191Z [command]/opt/conda/bin/git sparse-checkout disable 2025-05-07T19:45:06.4741145Z [command]/opt/conda/bin/git config --local --unset-all extensions.worktreeConfig 2025-05-07T19:45:06.4774593Z ##[group]Checking out the ref 2025-05-07T19:45:06.4778007Z [command]/opt/conda/bin/git checkout --progress --force refs/remotes/pull/4066/merge 2025-05-07T19:45:06.5915036Z Note: switching to 'refs/remotes/pull/4066/merge'. 2025-05-07T19:45:06.5915413Z 2025-05-07T19:45:06.5915788Z You are in 'detached HEAD' state. You can look around, make experimental 2025-05-07T19:45:06.5917335Z changes and commit them, and you can discard any commits you make in this 2025-05-07T19:45:06.5918203Z state without impacting any branches by switching back to a branch. 2025-05-07T19:45:06.5918687Z 2025-05-07T19:45:06.5919007Z If you want to create a new branch to retain commits you create, you may 2025-05-07T19:45:06.5942539Z do so (now or later) by using -c with the switch command. Example: 2025-05-07T19:45:06.5942817Z 2025-05-07T19:45:06.5942974Z git switch -c 2025-05-07T19:45:06.5943155Z 2025-05-07T19:45:06.5943261Z Or undo this operation with: 2025-05-07T19:45:06.5943452Z 2025-05-07T19:45:06.5943534Z git switch - 2025-05-07T19:45:06.5943661Z 2025-05-07T19:45:06.5943903Z Turn off this advice by setting config variable advice.detachedHead to false 2025-05-07T19:45:06.5944211Z 2025-05-07T19:45:06.5944573Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4 2025-05-07T19:45:06.5945726Z ##[endgroup] 2025-05-07T19:45:06.5946162Z ##[group]Setting up auth for fetching submodules 2025-05-07T19:45:06.5946743Z [command]/opt/conda/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-05-07T19:45:06.5985630Z [command]/opt/conda/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-05-07T19:45:06.6023224Z [command]/opt/conda/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-05-07T19:45:06.6060172Z [command]/opt/conda/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-05-07T19:45:06.6093057Z ##[endgroup] 2025-05-07T19:45:06.6093406Z ##[group]Fetching submodules 2025-05-07T19:45:06.6097257Z [command]/opt/conda/bin/git submodule sync --recursive 2025-05-07T19:45:06.6579667Z [command]/opt/conda/bin/git -c protocol.version=2 submodule update --init --force --depth=1 --recursive 2025-05-07T19:45:06.7052733Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'external/asmjit' 2025-05-07T19:45:06.7054559Z Submodule 'external/composable_kernel' (https://github.com/jwfromm/composable_kernel.git) registered for path 'external/composable_kernel' 2025-05-07T19:45:06.7060027Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'external/cpuinfo' 2025-05-07T19:45:06.7065170Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'external/cutlass' 2025-05-07T19:45:06.7069936Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'external/googletest' 2025-05-07T19:45:06.7075359Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'external/hipify_torch' 2025-05-07T19:45:06.7080535Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'external/json' 2025-05-07T19:45:06.7121799Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit'... 2025-05-07T19:45:07.0978729Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/composable_kernel'... 2025-05-07T19:45:08.5515087Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/cpuinfo'... 2025-05-07T19:45:08.9524871Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/cutlass'... 2025-05-07T19:45:11.2139517Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/googletest'... 2025-05-07T19:45:11.5665495Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/hipify_torch'... 2025-05-07T19:45:12.0594406Z Cloning into '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/json'... 2025-05-07T19:45:13.1976522Z From https://github.com/asmjit/asmjit 2025-05-07T19:45:13.1976973Z * branch e5d7c0bd5d9aec44d68830187138149e6a8c4e32 -> FETCH_HEAD 2025-05-07T19:45:13.2467863Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32' 2025-05-07T19:45:13.9001393Z From https://github.com/jwfromm/composable_kernel 2025-05-07T19:45:13.9001865Z * branch 4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 -> FETCH_HEAD 2025-05-07T19:45:14.1778002Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406' 2025-05-07T19:45:14.8434125Z From https://github.com/pytorch/cpuinfo 2025-05-07T19:45:14.8434621Z * branch 6543fec09b2f04ac4a666882998b534afc9c1349 -> FETCH_HEAD 2025-05-07T19:45:14.9419936Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-05-07T19:45:16.0884555Z From https://github.com/jwfromm/cutlass 2025-05-07T19:45:16.0885309Z * branch 3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 -> FETCH_HEAD 2025-05-07T19:45:16.7892846Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3' 2025-05-07T19:45:17.5034840Z From https://github.com/google/googletest 2025-05-07T19:45:17.5035294Z * branch f8d7d77c06936315286eb55f8de22cd23c188571 -> FETCH_HEAD 2025-05-07T19:45:17.5441062Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571' 2025-05-07T19:45:18.0905127Z From https://github.com/ROCmSoftwarePlatform/hipify_torch 2025-05-07T19:45:18.0905608Z * branch 420084499c7c1e1c2d801922f40df202eac5f3a0 -> FETCH_HEAD 2025-05-07T19:45:18.0993274Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0' 2025-05-07T19:45:19.8263279Z From https://github.com/nlohmann/json 2025-05-07T19:45:19.8263722Z * branch 9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 -> FETCH_HEAD 2025-05-07T19:45:19.9380249Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-05-07T19:45:19.9429708Z [command]/opt/conda/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-05-07T19:45:19.9897859Z Entering 'external/asmjit' 2025-05-07T19:45:19.9968132Z Entering 'external/composable_kernel' 2025-05-07T19:45:20.0049785Z Entering 'external/cpuinfo' 2025-05-07T19:45:20.0121833Z Entering 'external/cutlass' 2025-05-07T19:45:20.0201779Z Entering 'external/googletest' 2025-05-07T19:45:20.0272512Z Entering 'external/hipify_torch' 2025-05-07T19:45:20.0345649Z Entering 'external/json' 2025-05-07T19:45:20.0430311Z ##[endgroup] 2025-05-07T19:45:20.0430754Z ##[group]Persisting credentials for submodules 2025-05-07T19:45:20.0437475Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-05-07T19:45:20.0900389Z Entering 'external/asmjit' 2025-05-07T19:45:20.0995833Z Entering 'external/composable_kernel' 2025-05-07T19:45:20.1100699Z Entering 'external/cpuinfo' 2025-05-07T19:45:20.1196544Z Entering 'external/cutlass' 2025-05-07T19:45:20.1300344Z Entering 'external/googletest' 2025-05-07T19:45:20.1396691Z Entering 'external/hipify_torch' 2025-05-07T19:45:20.1491830Z Entering 'external/json' 2025-05-07T19:45:20.1607711Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-05-07T19:45:20.2071263Z Entering 'external/asmjit' 2025-05-07T19:45:20.2160326Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/asmjit/config remote.origin.url 2025-05-07T19:45:20.2189747Z Entering 'external/composable_kernel' 2025-05-07T19:45:20.2279216Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/composable_kernel/config remote.origin.url 2025-05-07T19:45:20.2314688Z Entering 'external/cpuinfo' 2025-05-07T19:45:20.2402710Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/cpuinfo/config remote.origin.url 2025-05-07T19:45:20.2431232Z Entering 'external/cutlass' 2025-05-07T19:45:20.2524917Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/cutlass/config remote.origin.url 2025-05-07T19:45:20.2562322Z Entering 'external/googletest' 2025-05-07T19:45:20.2650073Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/googletest/config remote.origin.url 2025-05-07T19:45:20.2678081Z Entering 'external/hipify_torch' 2025-05-07T19:45:20.2767888Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/hipify_torch/config remote.origin.url 2025-05-07T19:45:20.2796974Z Entering 'external/json' 2025-05-07T19:45:20.2890125Z file:/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.git/modules/external/json/config remote.origin.url 2025-05-07T19:45:20.3012572Z [command]/opt/conda/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-05-07T19:45:20.3475181Z Entering 'external/asmjit' 2025-05-07T19:45:20.3547410Z Entering 'external/composable_kernel' 2025-05-07T19:45:20.3634886Z Entering 'external/cpuinfo' 2025-05-07T19:45:20.3707158Z Entering 'external/cutlass' 2025-05-07T19:45:20.3787831Z Entering 'external/googletest' 2025-05-07T19:45:20.3860203Z Entering 'external/hipify_torch' 2025-05-07T19:45:20.3931688Z Entering 'external/json' 2025-05-07T19:45:20.4023919Z [command]/opt/conda/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-05-07T19:45:20.4491606Z Entering 'external/asmjit' 2025-05-07T19:45:20.4561025Z Entering 'external/composable_kernel' 2025-05-07T19:45:20.4636638Z Entering 'external/cpuinfo' 2025-05-07T19:45:20.4713938Z Entering 'external/cutlass' 2025-05-07T19:45:20.4793323Z Entering 'external/googletest' 2025-05-07T19:45:20.4863903Z Entering 'external/hipify_torch' 2025-05-07T19:45:20.4934570Z Entering 'external/json' 2025-05-07T19:45:20.5017855Z ##[endgroup] 2025-05-07T19:45:20.5065074Z [command]/opt/conda/bin/git log -1 --format=%H 2025-05-07T19:45:20.5098129Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T19:45:20.5252648Z ##[group]Run echo "ENV VARS" 2025-05-07T19:45:20.5252938Z echo "ENV VARS" 2025-05-07T19:45:20.5253186Z echo "${GITHUB_REF_NAME}" 2025-05-07T19:45:20.5253452Z echo "${GITHUB_REF}" 2025-05-07T19:45:20.5253706Z echo "${GITHUB_BASE_REF}" 2025-05-07T19:45:20.5253965Z  2025-05-07T19:45:20.5254173Z echo "GITHUB PROVIDED" 2025-05-07T19:45:20.5254426Z echo "4066/merge" 2025-05-07T19:45:20.5254657Z echo "" 2025-05-07T19:45:20.5255079Z echo "refs/pull/4066/merge" 2025-05-07T19:45:20.5255340Z echo "main" 2025-05-07T19:45:20.5255706Z shell: bash --noprofile --norc -e -o pipefail {0} 2025-05-07T19:45:20.5256006Z env: 2025-05-07T19:45:20.5256196Z PYTHON_VERSION: 3.9 2025-05-07T19:45:20.5256417Z PACKAGE_TYPE: wheel 2025-05-07T19:45:20.5256650Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:20.5256882Z REF: 2025-05-07T19:45:20.5257063Z CU_VERSION: cu128 2025-05-07T19:45:20.5257276Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:20.5257511Z ARCH: x86_64 2025-05-07T19:45:20.5257714Z BUILD_TARGET: genai 2025-05-07T19:45:20.5257921Z CHANNEL: nightly 2025-05-07T19:45:20.5258127Z PLATFORM: 2025-05-07T19:45:20.5258312Z ##[endgroup] 2025-05-07T19:45:20.6596577Z ENV VARS 2025-05-07T19:45:20.6596890Z 4066/merge 2025-05-07T19:45:20.6597146Z refs/pull/4066/merge 2025-05-07T19:45:20.6597425Z main 2025-05-07T19:45:20.6597653Z GITHUB PROVIDED 2025-05-07T19:45:20.6597843Z 4066/merge 2025-05-07T19:45:20.6597968Z 2025-05-07T19:45:20.6598075Z refs/pull/4066/merge 2025-05-07T19:45:20.6598350Z main 2025-05-07T19:45:20.6653049Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:20.6653360Z set -euxo pipefail 2025-05-07T19:45:20.6653842Z # Set artifact name here since github actions doesn't have string manipulation tools 2025-05-07T19:45:20.6654659Z # and "/" is not allowed in artifact names. //\//_ is to replace all forward slashes, 2025-05-07T19:45:20.6655090Z # not just the first one 2025-05-07T19:45:20.6655595Z echo "ARTIFACT_NAME=${REPOSITORY//\//_}_${REF//\//_}_${PYTHON_VERSION}_${CU_VERSION}_${ARCH}" >> "${GITHUB_ENV}" 2025-05-07T19:45:20.6656500Z shell: bash --noprofile --norc -e -o pipefail {0} 2025-05-07T19:45:20.6656802Z env: 2025-05-07T19:45:20.6657200Z PYTHON_VERSION: 3.9 2025-05-07T19:45:20.6657436Z PACKAGE_TYPE: wheel 2025-05-07T19:45:20.6657673Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:20.6657907Z REF: 2025-05-07T19:45:20.6658086Z CU_VERSION: cu128 2025-05-07T19:45:20.6658306Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:20.6658540Z ARCH: x86_64 2025-05-07T19:45:20.6658737Z BUILD_TARGET: genai 2025-05-07T19:45:20.6658952Z CHANNEL: nightly 2025-05-07T19:45:20.6659150Z PLATFORM: 2025-05-07T19:45:20.6659341Z ##[endgroup] 2025-05-07T19:45:20.7209392Z + echo ARTIFACT_NAME=pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:20.7271179Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:20.7271469Z set -euxo pipefail 2025-05-07T19:45:20.7271787Z conda info | grep -i 'base environment' 2025-05-07T19:45:20.7272306Z conda clean --all --quiet --yes 2025-05-07T19:45:20.7272766Z shell: bash -l {0} 2025-05-07T19:45:20.7272971Z env: 2025-05-07T19:45:20.7273161Z PYTHON_VERSION: 3.9 2025-05-07T19:45:20.7273405Z PACKAGE_TYPE: wheel 2025-05-07T19:45:20.7273637Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:20.7273878Z REF: 2025-05-07T19:45:20.7274058Z CU_VERSION: cu128 2025-05-07T19:45:20.7274290Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:20.7274534Z ARCH: x86_64 2025-05-07T19:45:20.7274743Z BUILD_TARGET: genai 2025-05-07T19:45:20.7274962Z CHANNEL: nightly 2025-05-07T19:45:20.7275217Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:20.7285648Z PLATFORM: 2025-05-07T19:45:20.7285877Z ##[endgroup] 2025-05-07T19:45:20.8500507Z + conda info 2025-05-07T19:45:20.8500769Z + grep -i 'base environment' 2025-05-07T19:45:21.5743445Z base environment : /opt/conda (writable) 2025-05-07T19:45:21.5745700Z + conda clean --all --quiet --yes 2025-05-07T19:45:22.3149809Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:22.3150107Z set -euxo pipefail 2025-05-07T19:45:22.3150410Z conda config --set channel_priority false 2025-05-07T19:45:22.3150791Z shell: bash -l {0} 2025-05-07T19:45:22.3151002Z env: 2025-05-07T19:45:22.3151190Z PYTHON_VERSION: 3.9 2025-05-07T19:45:22.3151411Z PACKAGE_TYPE: wheel 2025-05-07T19:45:22.3151646Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:22.3152283Z REF: 2025-05-07T19:45:22.3152466Z CU_VERSION: cu128 2025-05-07T19:45:22.3152679Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:22.3152915Z ARCH: x86_64 2025-05-07T19:45:22.3153112Z BUILD_TARGET: genai 2025-05-07T19:45:22.3153326Z CHANNEL: nightly 2025-05-07T19:45:22.3153580Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:22.3153872Z PLATFORM: 2025-05-07T19:45:22.3154067Z ##[endgroup] 2025-05-07T19:45:22.5058044Z + conda config --set channel_priority false 2025-05-07T19:45:22.7864444Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:22.7864759Z set -euxo pipefail 2025-05-07T19:45:22.7865125Z CONDA_ENV="${RUNNER_TEMP}/pytorch_pkg_helpers_${GITHUB_RUN_ID}" 2025-05-07T19:45:22.7865509Z conda create \ 2025-05-07T19:45:22.7865759Z  --yes --quiet \ 2025-05-07T19:45:22.7866022Z  --prefix "${CONDA_ENV}" \ 2025-05-07T19:45:22.7866295Z  "python=3.9" 2025-05-07T19:45:22.7866538Z CONDA_ENV="${CONDA_ENV}" 2025-05-07T19:45:22.7866840Z CONDA_RUN="conda run -p ${CONDA_ENV}" 2025-05-07T19:45:22.7867326Z ${CONDA_RUN} python -m pip install ${GITHUB_WORKSPACE}/test-infra/tools/pkg-helpers 2025-05-07T19:45:22.7867860Z BUILD_ENV_FILE="${RUNNER_TEMP}/build_env_${GITHUB_RUN_ID}" 2025-05-07T19:45:22.7868336Z ${CONDA_RUN} python -m pytorch_pkg_helpers > "${BUILD_ENV_FILE}" 2025-05-07T19:45:22.7868722Z cat "${BUILD_ENV_FILE}" 2025-05-07T19:45:22.7869067Z echo "BUILD_ENV_FILE=${BUILD_ENV_FILE}" >> "${GITHUB_ENV}" 2025-05-07T19:45:22.7869491Z shell: bash -l {0} 2025-05-07T19:45:22.7869685Z env: 2025-05-07T19:45:22.7869874Z PYTHON_VERSION: 3.9 2025-05-07T19:45:22.7870096Z PACKAGE_TYPE: wheel 2025-05-07T19:45:22.7870559Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:22.7870806Z REF: 2025-05-07T19:45:22.7870981Z CU_VERSION: cu128 2025-05-07T19:45:22.7871199Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:22.7871429Z ARCH: x86_64 2025-05-07T19:45:22.7871635Z BUILD_TARGET: genai 2025-05-07T19:45:22.7871843Z CHANNEL: nightly 2025-05-07T19:45:22.7872206Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:22.7872501Z PLATFORM: 2025-05-07T19:45:22.7872698Z ##[endgroup] 2025-05-07T19:45:22.9666815Z + CONDA_ENV=/__w/_temp/pytorch_pkg_helpers_14891846312 2025-05-07T19:45:22.9667350Z + conda create --yes --quiet --prefix /__w/_temp/pytorch_pkg_helpers_14891846312 python=3.9 2025-05-07T19:45:26.3703936Z Collecting package metadata (current_repodata.json): ...working... done 2025-05-07T19:45:26.4147392Z Solving environment: ...working... done 2025-05-07T19:45:28.8536198Z 2025-05-07T19:45:28.8536615Z ## Package Plan ## 2025-05-07T19:45:28.8536837Z 2025-05-07T19:45:28.8537122Z environment location: /__w/_temp/pytorch_pkg_helpers_14891846312 2025-05-07T19:45:28.8537461Z 2025-05-07T19:45:28.8537551Z added / updated specs: 2025-05-07T19:45:28.8537784Z - python=3.9 2025-05-07T19:45:28.8537919Z 2025-05-07T19:45:28.8537924Z 2025-05-07T19:45:28.8538060Z The following packages will be downloaded: 2025-05-07T19:45:28.8538271Z 2025-05-07T19:45:28.8538387Z package | build 2025-05-07T19:45:28.8538703Z ---------------------------|----------------- 2025-05-07T19:45:28.8539092Z _libgcc_mutex-0.1 | main 3 KB 2025-05-07T19:45:28.8539486Z _openmp_mutex-5.1 | 1_gnu 21 KB 2025-05-07T19:45:28.8539877Z ld_impl_linux-64-2.40 | h12ee557_0 710 KB 2025-05-07T19:45:28.8540257Z libffi-3.4.4 | h6a678d5_1 141 KB 2025-05-07T19:45:28.8540608Z pip-25.1 | pyhc872135_2 1.3 MB 2025-05-07T19:45:28.8540965Z python-3.9.21 | he870216_1 25.1 MB 2025-05-07T19:45:28.8541345Z setuptools-78.1.1 | py39h06a4308_0 1.7 MB 2025-05-07T19:45:28.8541728Z sqlite-3.45.3 | h5eee18b_0 1.2 MB 2025-05-07T19:45:28.8542410Z tk-8.6.14 | h39e8969_0 3.4 MB 2025-05-07T19:45:28.8542764Z tzdata-2025b | h04d1e81_0 116 KB 2025-05-07T19:45:28.8543130Z wheel-0.45.1 | py39h06a4308_0 114 KB 2025-05-07T19:45:28.8543470Z xz-5.6.4 | h5eee18b_1 567 KB 2025-05-07T19:45:28.8543809Z zlib-1.2.13 | h5eee18b_1 111 KB 2025-05-07T19:45:28.8544150Z ------------------------------------------------------------ 2025-05-07T19:45:28.8544473Z Total: 34.5 MB 2025-05-07T19:45:28.8544672Z 2025-05-07T19:45:28.8544803Z The following NEW packages will be INSTALLED: 2025-05-07T19:45:28.8545022Z 2025-05-07T19:45:28.8545219Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T19:45:28.8545653Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T19:45:28.8546149Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T19:45:28.8546695Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T19:45:28.8547157Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T19:45:28.8547569Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T19:45:28.8548005Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T19:45:28.8548455Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T19:45:28.8548898Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T19:45:28.8549793Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T19:45:28.8550992Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T19:45:28.8551407Z python pkgs/main/linux-64::python-3.9.21-he870216_1 2025-05-07T19:45:28.8551817Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T19:45:28.8552396Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 2025-05-07T19:45:28.8552846Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T19:45:28.8553227Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T19:45:28.8553596Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T19:45:28.8553991Z wheel pkgs/main/linux-64::wheel-0.45.1-py39h06a4308_0 2025-05-07T19:45:28.8554376Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T19:45:28.8554779Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T19:45:28.8555024Z 2025-05-07T19:45:28.8555028Z 2025-05-07T19:45:29.0334501Z Preparing transaction: ...working... done 2025-05-07T19:45:30.4841471Z Verifying transaction: ...working... done 2025-05-07T19:45:33.5397445Z Executing transaction: ...working... done 2025-05-07T19:45:33.8439357Z + CONDA_ENV=/__w/_temp/pytorch_pkg_helpers_14891846312 2025-05-07T19:45:33.8439812Z + CONDA_RUN='conda run -p /__w/_temp/pytorch_pkg_helpers_14891846312' 2025-05-07T19:45:33.8440472Z + conda run -p /__w/_temp/pytorch_pkg_helpers_14891846312 python -m pip install /__w/FBGEMM/FBGEMM/test-infra/tools/pkg-helpers 2025-05-07T19:45:36.1547672Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2025-05-07T19:45:36.1549353Z 2025-05-07T19:45:36.1549538Z Processing /__w/FBGEMM/FBGEMM/test-infra/tools/pkg-helpers 2025-05-07T19:45:36.1549941Z Installing build dependencies: started 2025-05-07T19:45:36.1550327Z Installing build dependencies: finished with status 'done' 2025-05-07T19:45:36.1550715Z Getting requirements to build wheel: started 2025-05-07T19:45:36.1551124Z Getting requirements to build wheel: finished with status 'done' 2025-05-07T19:45:36.1551863Z Preparing metadata (pyproject.toml): started 2025-05-07T19:45:36.1552335Z Preparing metadata (pyproject.toml): finished with status 'done' 2025-05-07T19:45:36.1552803Z Building wheels for collected packages: pytorch-pkg-helpers 2025-05-07T19:45:36.1553259Z Building wheel for pytorch-pkg-helpers (pyproject.toml): started 2025-05-07T19:45:36.1553806Z Building wheel for pytorch-pkg-helpers (pyproject.toml): finished with status 'done' 2025-05-07T19:45:36.1554734Z Created wheel for pytorch-pkg-helpers: filename=pytorch_pkg_helpers-0.1.5-py3-none-any.whl size=7655 sha256=ac9455de5c8d4faabb3965576c35f8951a075a031d5da662b61e2ec94a287dd7 2025-05-07T19:45:36.1555770Z Stored in directory: /github/home/.cache/pip/wheels/98/9c/c4/592565e0f8c585aaecf739e5a9cf537367404a335919ec0833 2025-05-07T19:45:36.1556348Z Successfully built pytorch-pkg-helpers 2025-05-07T19:45:36.1556692Z Installing collected packages: pytorch-pkg-helpers 2025-05-07T19:45:36.1557085Z Successfully installed pytorch-pkg-helpers-0.1.5 2025-05-07T19:45:36.1557323Z 2025-05-07T19:45:36.1978564Z + BUILD_ENV_FILE=/__w/_temp/build_env_14891846312 2025-05-07T19:45:36.1979197Z + conda run -p /__w/_temp/pytorch_pkg_helpers_14891846312 python -m pytorch_pkg_helpers 2025-05-07T19:45:37.2274514Z + cat /__w/_temp/build_env_14891846312 2025-05-07T19:45:37.2302408Z # WARNING: Base version not found defaulting BUILD_VERSION to 0.1.0 2025-05-07T19:45:37.2302989Z export BUILD_VERSION='0.1.0.dev20250507+cu128' 2025-05-07T19:45:37.2303443Z export CUDA_HOME='/usr/local/cuda-12.8' 2025-05-07T19:45:37.2303795Z export CUDA_PATH='/usr/local/cuda-12.8' 2025-05-07T19:45:37.2304068Z export FORCE_CUDA=1 2025-05-07T19:45:37.2304680Z export PATH="/opt/python/cp39-cp39/bin:${PATH}" 2025-05-07T19:45:37.2305041Z export PATH="/usr/local/cuda-12.8/bin:${PATH}" 2025-05-07T19:45:37.2305617Z export PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T19:45:37.2306246Z export PYTORCH_S3_BUCKET_PATH='s3://pytorch/whl/nightly/cu128/' 2025-05-07T19:45:37.2306623Z export PYTORCH_VERSION_SUFFIX='+cu128' 2025-05-07T19:45:37.2307003Z export TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:45:37.2307452Z export VERSION_SUFFIX='+cu128' 2025-05-07T19:45:37.2307784Z export WHEEL_DIR='cu128/' 2025-05-07T19:45:37.2307967Z 2025-05-07T19:45:37.2308107Z + echo BUILD_ENV_FILE=/__w/_temp/build_env_14891846312 2025-05-07T19:45:37.2373876Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:37.2374211Z set -euxo pipefail 2025-05-07T19:45:37.2374569Z CONDA_ENV="${RUNNER_TEMP}/conda_environment_${GITHUB_RUN_ID}" 2025-05-07T19:45:37.2374970Z export CONDA_EXTRA_PARAM="" 2025-05-07T19:45:37.2375265Z  2025-05-07T19:45:37.2375500Z if [[ "${PYTHON_VERSION:-}" == "3.13t" ]]; then 2025-05-07T19:45:37.2375830Z  export PYTHON_VERSION=3.13 2025-05-07T19:45:37.2376230Z  export CONDA_EXTRA_PARAM=" python-freethreading -c conda-forge" 2025-05-07T19:45:37.2376628Z  2025-05-07T19:45:37.2376889Z  # downgrade conda version for python 3.13t install. 2025-05-07T19:45:37.2377337Z  # TODO: remove this once python 3.13t is fully supported on conda 2025-05-07T19:45:37.2377843Z  # Please see : https://github.com/conda/conda/issues/14554 2025-05-07T19:45:37.2378251Z  if [[ "$(uname)" == Darwin ]]; then 2025-05-07T19:45:37.2378607Z  # required to be able to downgrade on MacOS arm64 2025-05-07T19:45:37.2378954Z  conda install -y python=3.9 2025-05-07T19:45:37.2379335Z  if [[ -n "$(conda list | grep conda-anaconda-telemetry)" ]]; then 2025-05-07T19:45:37.2379826Z  conda uninstall -y conda-anaconda-telemetry conda-anaconda-tos 2025-05-07T19:45:37.2380213Z  fi 2025-05-07T19:45:37.2380410Z  fi 2025-05-07T19:45:37.2380708Z  conda install -y conda=24.7.1 conda-libmamba-solver=24.1.0 2025-05-07T19:45:37.2381266Z fi 2025-05-07T19:45:37.2381448Z  2025-05-07T19:45:37.2381643Z conda create \ 2025-05-07T19:45:37.2381883Z  --yes --quiet \ 2025-05-07T19:45:37.2382146Z  --prefix "${CONDA_ENV}" \ 2025-05-07T19:45:37.2382439Z  "python=${PYTHON_VERSION}" \ 2025-05-07T19:45:37.2382724Z  cmake=3.31.2 \ 2025-05-07T19:45:37.2382966Z  ninja=1.12.1 \ 2025-05-07T19:45:37.2383210Z  pkg-config=0.29 \ 2025-05-07T19:45:37.2383459Z  wheel=0.37 \ 2025-05-07T19:45:37.2383698Z  ${CONDA_EXTRA_PARAM} 2025-05-07T19:45:37.2383947Z  2025-05-07T19:45:37.2384198Z echo "CONDA_ENV=${CONDA_ENV}" >> "${GITHUB_ENV}" 2025-05-07T19:45:37.2384630Z echo "CONDA_RUN=conda run -p ${CONDA_ENV}" >> "${GITHUB_ENV}" 2025-05-07T19:45:37.2385422Z shell: bash -l {0} 2025-05-07T19:45:37.2385622Z env: 2025-05-07T19:45:37.2385825Z PYTHON_VERSION: 3.9 2025-05-07T19:45:37.2386048Z PACKAGE_TYPE: wheel 2025-05-07T19:45:37.2386285Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:37.2386522Z REF: 2025-05-07T19:45:37.2386707Z CU_VERSION: cu128 2025-05-07T19:45:37.2386929Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:37.2387164Z ARCH: x86_64 2025-05-07T19:45:37.2387369Z BUILD_TARGET: genai 2025-05-07T19:45:37.2387624Z CHANNEL: nightly 2025-05-07T19:45:37.2387875Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:37.2388220Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T19:45:37.2388522Z PLATFORM: 2025-05-07T19:45:37.2388727Z ##[endgroup] 2025-05-07T19:45:37.4653113Z + CONDA_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T19:45:37.4653486Z + export CONDA_EXTRA_PARAM= 2025-05-07T19:45:37.4654251Z + CONDA_EXTRA_PARAM= 2025-05-07T19:45:37.4654472Z + [[ 3.9 == \3\.\1\3\t ]] 2025-05-07T19:45:37.4655056Z + conda create --yes --quiet --prefix /__w/_temp/conda_environment_14891846312 python=3.9 cmake=3.31.2 ninja=1.12.1 pkg-config=0.29 wheel=0.37 2025-05-07T19:45:40.0469736Z Collecting package metadata (current_repodata.json): ...working... done 2025-05-07T19:45:40.1172485Z Solving environment: ...working... done 2025-05-07T19:45:41.8635066Z 2025-05-07T19:45:41.8635594Z ## Package Plan ## 2025-05-07T19:45:41.8635787Z 2025-05-07T19:45:41.8635985Z environment location: /__w/_temp/conda_environment_14891846312 2025-05-07T19:45:41.8636269Z 2025-05-07T19:45:41.8636369Z added / updated specs: 2025-05-07T19:45:41.8636598Z - cmake=3.31.2 2025-05-07T19:45:41.8636803Z - ninja=1.12.1 2025-05-07T19:45:41.8637069Z - pkg-config=0.29 2025-05-07T19:45:41.8637289Z - python=3.9 2025-05-07T19:45:41.8637481Z - wheel=0.37 2025-05-07T19:45:41.8637610Z 2025-05-07T19:45:41.8637643Z 2025-05-07T19:45:41.8637762Z The following packages will be downloaded: 2025-05-07T19:45:41.8637974Z 2025-05-07T19:45:41.8638090Z package | build 2025-05-07T19:45:41.8638394Z ---------------------------|----------------- 2025-05-07T19:45:41.8638737Z bzip2-1.0.8 | h5eee18b_6 262 KB 2025-05-07T19:45:41.8639099Z c-ares-1.19.1 | h5eee18b_0 118 KB 2025-05-07T19:45:41.8639458Z cmake-3.31.2 | h27e300b_0 20.9 MB 2025-05-07T19:45:41.8639815Z libcurl-8.12.1 | hc9e6f67_0 469 KB 2025-05-07T19:45:41.8640191Z libedit-3.1.20230828 | h5eee18b_0 179 KB 2025-05-07T19:45:41.8640569Z libnghttp2-1.57.0 | h2d74bed_0 674 KB 2025-05-07T19:45:41.8640945Z libssh2-1.11.1 | h251f7ec_0 308 KB 2025-05-07T19:45:41.8641306Z libuv-1.48.0 | h5eee18b_0 950 KB 2025-05-07T19:45:41.8641652Z lz4-c-1.9.4 | h6a678d5_1 156 KB 2025-05-07T19:45:41.8642033Z ninja-1.12.1 | h06a4308_0 8 KB 2025-05-07T19:45:41.8642394Z pkg-config-0.29.2 | h1bed415_8 245 KB 2025-05-07T19:45:41.8643041Z rhash-1.4.3 | hdbd6064_0 220 KB 2025-05-07T19:45:41.8643393Z wheel-0.37.1 | pyhd3eb1b0_0 33 KB 2025-05-07T19:45:41.8643737Z zstd-1.5.6 | hc292b87_0 664 KB 2025-05-07T19:45:41.8644074Z ------------------------------------------------------------ 2025-05-07T19:45:41.8644384Z Total: 25.1 MB 2025-05-07T19:45:41.8644586Z 2025-05-07T19:45:41.8644709Z The following NEW packages will be INSTALLED: 2025-05-07T19:45:41.8644925Z 2025-05-07T19:45:41.8645120Z _libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main 2025-05-07T19:45:41.8645552Z _openmp_mutex pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 2025-05-07T19:45:41.8645965Z bzip2 pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 2025-05-07T19:45:41.8646352Z c-ares pkgs/main/linux-64::c-ares-1.19.1-h5eee18b_0 2025-05-07T19:45:41.8646839Z ca-certificates pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 2025-05-07T19:45:41.8647312Z cmake pkgs/main/linux-64::cmake-3.31.2-h27e300b_0 2025-05-07T19:45:41.8647693Z expat pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 2025-05-07T19:45:41.8648064Z krb5 pkgs/main/linux-64::krb5-1.20.1-h143b758_1 2025-05-07T19:45:41.8648481Z ld_impl_linux-64 pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 2025-05-07T19:45:41.8649140Z libcurl pkgs/main/linux-64::libcurl-8.12.1-hc9e6f67_0 2025-05-07T19:45:41.8649568Z libedit pkgs/main/linux-64::libedit-3.1.20230828-h5eee18b_0 2025-05-07T19:45:41.8649985Z libev pkgs/main/linux-64::libev-4.33-h7f8727e_1 2025-05-07T19:45:41.8650550Z libffi pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 2025-05-07T19:45:41.8650964Z libgcc-ng pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 2025-05-07T19:45:41.8651393Z libgomp pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 2025-05-07T19:45:41.8651826Z libnghttp2 pkgs/main/linux-64::libnghttp2-1.57.0-h2d74bed_0 2025-05-07T19:45:41.8652259Z libssh2 pkgs/main/linux-64::libssh2-1.11.1-h251f7ec_0 2025-05-07T19:45:41.8652759Z libstdcxx-ng pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 2025-05-07T19:45:41.8653269Z libuv pkgs/main/linux-64::libuv-1.48.0-h5eee18b_0 2025-05-07T19:45:41.8653652Z lz4-c pkgs/main/linux-64::lz4-c-1.9.4-h6a678d5_1 2025-05-07T19:45:41.8654026Z ncurses pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 2025-05-07T19:45:41.8654417Z ninja pkgs/main/linux-64::ninja-1.12.1-h06a4308_0 2025-05-07T19:45:41.8654844Z ninja-base pkgs/main/linux-64::ninja-base-1.12.1-hdb19cb5_0 2025-05-07T19:45:41.8655271Z openssl pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 2025-05-07T19:45:41.8655664Z pip pkgs/main/noarch::pip-25.1-pyhc872135_2 2025-05-07T19:45:41.8656074Z pkg-config pkgs/main/linux-64::pkg-config-0.29.2-h1bed415_8 2025-05-07T19:45:41.8656507Z python pkgs/main/linux-64::python-3.9.21-he870216_1 2025-05-07T19:45:41.8656920Z readline pkgs/main/linux-64::readline-8.2-h5eee18b_0 2025-05-07T19:45:41.8657311Z rhash pkgs/main/linux-64::rhash-1.4.3-hdbd6064_0 2025-05-07T19:45:41.8657749Z setuptools pkgs/main/linux-64::setuptools-78.1.1-py39h06a4308_0 2025-05-07T19:45:41.8658193Z sqlite pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 2025-05-07T19:45:41.8658567Z tk pkgs/main/linux-64::tk-8.6.14-h39e8969_0 2025-05-07T19:45:41.8658929Z tzdata pkgs/main/noarch::tzdata-2025b-h04d1e81_0 2025-05-07T19:45:41.8659329Z wheel pkgs/main/noarch::wheel-0.37.1-pyhd3eb1b0_0 2025-05-07T19:45:41.8659700Z xz pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 2025-05-07T19:45:41.8660049Z zlib pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 2025-05-07T19:45:41.8660516Z zstd pkgs/main/linux-64::zstd-1.5.6-hc292b87_0 2025-05-07T19:45:41.8660743Z 2025-05-07T19:45:41.8660747Z 2025-05-07T19:45:42.1388046Z Preparing transaction: ...working... done 2025-05-07T19:45:44.0548848Z Verifying transaction: ...working... done 2025-05-07T19:45:48.2046536Z Executing transaction: ...working... done 2025-05-07T19:45:48.5213516Z + echo CONDA_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T19:45:48.5214368Z + echo 'CONDA_RUN=conda run -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:45:48.5324019Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:48.5324338Z set -euxo pipefail 2025-05-07T19:45:48.5324682Z cat ".github/scripts/nova_dir.bash" >> "${BUILD_ENV_FILE}" 2025-05-07T19:45:48.5325120Z shell: sh -e {0} 2025-05-07T19:45:48.5325312Z env: 2025-05-07T19:45:48.5325508Z PYTHON_VERSION: 3.9 2025-05-07T19:45:48.5325728Z PACKAGE_TYPE: wheel 2025-05-07T19:45:48.5325974Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:48.5326210Z REF: 2025-05-07T19:45:48.5326384Z CU_VERSION: cu128 2025-05-07T19:45:48.5326602Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:48.5326829Z ARCH: x86_64 2025-05-07T19:45:48.5327029Z BUILD_TARGET: genai 2025-05-07T19:45:48.5327240Z CHANNEL: nightly 2025-05-07T19:45:48.5327490Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:48.5327825Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T19:45:48.5328172Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T19:45:48.5328592Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:45:48.5328931Z ##[endgroup] 2025-05-07T19:45:48.6597885Z + cat .github/scripts/nova_dir.bash 2025-05-07T19:45:48.6700813Z ##[group]Run set -euxo pipefail 2025-05-07T19:45:48.6701131Z set -euxo pipefail 2025-05-07T19:45:48.6701435Z # shellcheck disable=SC1090 2025-05-07T19:45:48.6701743Z source "${BUILD_ENV_FILE}" 2025-05-07T19:45:48.6702043Z # shellcheck disable=SC2086 2025-05-07T19:45:48.6702337Z ${CONDA_RUN} ${PIP_INSTALL_TORCH}  2025-05-07T19:45:48.6702670Z shell: sh -e {0} 2025-05-07T19:45:48.6702872Z env: 2025-05-07T19:45:48.6703054Z PYTHON_VERSION: 3.9 2025-05-07T19:45:48.6703283Z PACKAGE_TYPE: wheel 2025-05-07T19:45:48.6703524Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:45:48.6703763Z REF: 2025-05-07T19:45:48.6703940Z CU_VERSION: cu128 2025-05-07T19:45:48.6704159Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:45:48.6704389Z ARCH: x86_64 2025-05-07T19:45:48.6704598Z BUILD_TARGET: genai 2025-05-07T19:45:48.6704809Z CHANNEL: nightly 2025-05-07T19:45:48.6705062Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:45:48.6705431Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T19:45:48.6705780Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T19:45:48.6706187Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:45:48.6706542Z ##[endgroup] 2025-05-07T19:45:48.7888113Z + source /__w/_temp/build_env_14891846312 2025-05-07T19:45:48.7888605Z ++ export BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T19:45:48.7889031Z ++ BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T19:45:48.7889437Z ++ export CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T19:45:48.7889822Z ++ CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T19:45:48.7890168Z ++ export CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T19:45:48.7890450Z ++ CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T19:45:48.7890710Z ++ export FORCE_CUDA=1 2025-05-07T19:45:48.7890923Z ++ FORCE_CUDA=1 2025-05-07T19:45:48.7891688Z ++ export PATH=/opt/python/cp39-cp39/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:45:48.7892880Z ++ PATH=/opt/python/cp39-cp39/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:45:48.7894435Z ++ export PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:45:48.7896103Z ++ PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:45:48.7897199Z ++ export 'PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T19:45:48.7898203Z ++ PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T19:45:48.7899014Z ++ export PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T19:45:48.7899573Z ++ PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T19:45:48.7899943Z ++ export PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T19:45:48.7900232Z ++ PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T19:45:48.7900617Z ++ export 'TORCH_CUDA_ARCH_LIST=5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:45:48.7901079Z ++ TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:45:48.7901601Z ++ export VERSION_SUFFIX=+cu128 2025-05-07T19:45:48.7901945Z ++ VERSION_SUFFIX=+cu128 2025-05-07T19:45:48.7902258Z ++ export WHEEL_DIR=cu128/ 2025-05-07T19:45:48.7902571Z ++ WHEEL_DIR=cu128/ 2025-05-07T19:45:48.7902811Z ++ FBGEMM_DIR=/__w/FBGEMM/FBGEMM 2025-05-07T19:45:48.7903122Z ++ export FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:45:48.7903480Z ++ FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:45:48.7903764Z +++ pwd 2025-05-07T19:45:48.7903962Z ++ working_dir=/__w/FBGEMM/FBGEMM 2025-05-07T19:45:48.7904607Z ++ [[ /__w/FBGEMM/FBGEMM == \/\_\_\w\/\F\B\G\E\M\M\/\F\B\G\E\M\M\/\p\y\t\o\r\c\h\/\F\B\G\E\M\M ]] 2025-05-07T19:45:48.7905007Z ++ export BUILD_FROM_NOVA=1 2025-05-07T19:45:48.7905240Z ++ BUILD_FROM_NOVA=1 2025-05-07T19:45:48.7905452Z ++ [[ cu128 == \c\u* ]] 2025-05-07T19:45:48.7905807Z ++ echo 'Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:45:48.7906246Z ++ [[ /__w/_temp/conda_environment_14891846312 != '' ]] 2025-05-07T19:45:48.7906732Z ++ export 'CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:45:48.7907335Z ++ CONDA_RUN='conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:45:48.7907896Z ++ echo 'conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:45:48.7908410Z Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0 2025-05-07T19:45:48.7908912Z conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:45:48.7909314Z ++ [[ cu128 == \c\u\1\2\8 ]] 2025-05-07T19:45:48.7909608Z ++ export 'TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T19:45:48.7909973Z ++ TORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T19:45:48.7910341Z ++ echo 'Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T19:45:48.7911098Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128 2025-05-07T19:45:48.7911865Z Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T19:45:50.0078861Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128 2025-05-07T19:45:50.2054899Z Collecting torch 2025-05-07T19:45:50.2109940Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (30 kB) 2025-05-07T19:45:50.3400390Z Collecting filelock (from torch) 2025-05-07T19:45:50.3455001Z Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB) 2025-05-07T19:45:50.3909543Z Collecting typing-extensions>=4.10.0 (from torch) 2025-05-07T19:45:50.3956612Z Downloading https://download.pytorch.org/whl/nightly/typing_extensions-4.12.2-py3-none-any.whl (37 kB) 2025-05-07T19:45:50.4332685Z Collecting sympy>=1.13.3 (from torch) 2025-05-07T19:45:50.4374449Z Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB) 2025-05-07T19:45:50.4760645Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 167.3 MB/s eta 0:00:00 2025-05-07T19:45:50.5289172Z Collecting networkx (from torch) 2025-05-07T19:45:50.5340618Z Downloading https://download.pytorch.org/whl/nightly/networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-05-07T19:45:50.5479752Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 121.1 MB/s eta 0:00:00 2025-05-07T19:45:50.5874704Z Collecting jinja2 (from torch) 2025-05-07T19:45:50.5931159Z Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB) 2025-05-07T19:45:50.6342932Z Collecting fsspec (from torch) 2025-05-07T19:45:50.6380366Z Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB) 2025-05-07T19:45:50.7009388Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch) 2025-05-07T19:45:50.7072724Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T19:45:50.7520797Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch) 2025-05-07T19:45:50.7573099Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T19:45:50.7838800Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch) 2025-05-07T19:45:50.7926182Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB) 2025-05-07T19:45:50.8220964Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch) 2025-05-07T19:45:50.8262792Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB) 2025-05-07T19:45:50.8565237Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch) 2025-05-07T19:45:50.8608403Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB) 2025-05-07T19:45:50.8910392Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch) 2025-05-07T19:45:50.8950822Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T19:45:50.9262359Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch) 2025-05-07T19:45:50.9326033Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB) 2025-05-07T19:45:50.9754511Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch) 2025-05-07T19:45:50.9814991Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB) 2025-05-07T19:45:51.0108020Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch) 2025-05-07T19:45:51.0148417Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T19:45:51.0463809Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch) 2025-05-07T19:45:51.0517294Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB) 2025-05-07T19:45:51.0869092Z Collecting nvidia-nccl-cu12==2.26.2 (from torch) 2025-05-07T19:45:51.0913100Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB) 2025-05-07T19:45:51.1209881Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch) 2025-05-07T19:45:51.1250498Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB) 2025-05-07T19:45:51.1531184Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch) 2025-05-07T19:45:51.1590636Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB) 2025-05-07T19:45:51.1875384Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch) 2025-05-07T19:45:51.1951950Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB) 2025-05-07T19:45:51.2612357Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch) 2025-05-07T19:45:51.2653900Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB) 2025-05-07T19:45:51.2782972Z Requirement already satisfied: setuptools>=40.8.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from pytorch-triton==3.3.0+git96316ce5->torch) (78.1.1) 2025-05-07T19:45:51.3224084Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch) 2025-05-07T19:45:51.3276193Z Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-05-07T19:45:51.3369933Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 41.4 MB/s eta 0:00:00 2025-05-07T19:45:51.3760854Z Collecting MarkupSafe>=2.0 (from jinja2->torch) 2025-05-07T19:45:51.3804359Z Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25 kB) 2025-05-07T19:45:51.3930715Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp39-cp39-manylinux_2_28_x86_64.whl (1047.1 MB) 2025-05-07T19:46:03.2565043Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 29.1 MB/s eta 0:00:00 2025-05-07T19:46:03.2615279Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB) 2025-05-07T19:46:10.5479025Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 44.6 MB/s eta 0:00:00 2025-05-07T19:46:10.5537020Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB) 2025-05-07T19:46:10.6126558Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 176.9 MB/s eta 0:00:00 2025-05-07T19:46:10.6176595Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB) 2025-05-07T19:46:11.9041433Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 68.4 MB/s eta 0:00:00 2025-05-07T19:46:11.9119125Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB) 2025-05-07T19:46:11.9723382Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 11.9 MB/s eta 0:00:00 2025-05-07T19:46:11.9768707Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB) 2025-05-07T19:46:20.0524941Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 40.8 MB/s eta 0:00:00 2025-05-07T19:46:20.0575514Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB) 2025-05-07T19:46:22.9102055Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 67.7 MB/s eta 0:00:00 2025-05-07T19:46:22.9178538Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB) 2025-05-07T19:46:23.0028559Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 11.4 MB/s eta 0:00:00 2025-05-07T19:46:23.0089275Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB) 2025-05-07T19:46:23.8292646Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 77.6 MB/s eta 0:00:00 2025-05-07T19:46:23.8370006Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB) 2025-05-07T19:46:27.7068838Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 67.3 MB/s eta 0:00:00 2025-05-07T19:46:27.7142217Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB) 2025-05-07T19:46:32.5034316Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 56.2 MB/s eta 0:00:00 2025-05-07T19:46:32.5089961Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-05-07T19:46:35.4720624Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 52.9 MB/s eta 0:00:00 2025-05-07T19:46:35.4779568Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-05-07T19:46:38.9060737Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 58.7 MB/s eta 0:00:00 2025-05-07T19:46:38.9169303Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB) 2025-05-07T19:46:39.5550786Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 61.4 MB/s eta 0:00:00 2025-05-07T19:46:39.5604082Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-05-07T19:46:39.6092728Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.4 MB) 2025-05-07T19:46:42.4600503Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.4/153.4 MB 53.8 MB/s eta 0:00:00 2025-05-07T19:46:45.4407002Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, typing-extensions, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch 2025-05-07T19:47:31.3692021Z 2025-05-07T19:47:31.3740808Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128 typing-extensions-4.12.2 2025-05-07T19:47:31.3744510Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2025-05-07T19:47:32.0786283Z Prepare all required actions 2025-05-07T19:47:32.0786792Z Getting action download info 2025-05-07T19:47:32.2341563Z Download action repository 'actions/cache@v3' (SHA:2f8e54208210a422b2efd51efaa6bd6d7ca8920f) 2025-05-07T19:47:33.6942344Z ##[group]Run ./test-infra/.github/actions/run-script-with-cache 2025-05-07T19:47:33.6942691Z with: 2025-05-07T19:47:33.6942896Z repository: pytorch/FBGEMM 2025-05-07T19:47:33.6943180Z script: ../.github/scripts/nova_prescript.bash 2025-05-07T19:47:33.6943494Z is_windows: disabled 2025-05-07T19:47:33.6943696Z env: 2025-05-07T19:47:33.6943881Z PYTHON_VERSION: 3.9 2025-05-07T19:47:33.6944097Z PACKAGE_TYPE: wheel 2025-05-07T19:47:33.6944338Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:47:33.6944569Z REF: 2025-05-07T19:47:33.6944756Z CU_VERSION: cu128 2025-05-07T19:47:33.6944972Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:47:33.6945197Z ARCH: x86_64 2025-05-07T19:47:33.6945391Z BUILD_TARGET: genai 2025-05-07T19:47:33.6945621Z CHANNEL: nightly 2025-05-07T19:47:33.6945867Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:47:33.6946203Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T19:47:33.6946535Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:33.6946940Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:33.6947280Z ##[endgroup] 2025-05-07T19:47:33.7055196Z ##[group]Run echo "today=$(/bin/date -u '+%Y%m%d')d" >> "${GITHUB_OUTPUT}" 2025-05-07T19:47:33.7055694Z echo "today=$(/bin/date -u '+%Y%m%d')d" >> "${GITHUB_OUTPUT}" 2025-05-07T19:47:33.7056148Z shell: bash --noprofile --norc -e -o pipefail {0} 2025-05-07T19:47:33.7056444Z env: 2025-05-07T19:47:33.7056628Z PYTHON_VERSION: 3.9 2025-05-07T19:47:33.7056853Z PACKAGE_TYPE: wheel 2025-05-07T19:47:33.7057085Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:47:33.7057321Z REF: 2025-05-07T19:47:33.7057494Z CU_VERSION: cu128 2025-05-07T19:47:33.7057711Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:47:33.7057946Z ARCH: x86_64 2025-05-07T19:47:33.7058149Z BUILD_TARGET: genai 2025-05-07T19:47:33.7058359Z CHANNEL: nightly 2025-05-07T19:47:33.7058602Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:47:33.7058939Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T19:47:33.7059277Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:33.7059675Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:33.7060022Z ##[endgroup] 2025-05-07T19:47:33.8715616Z ##[group]Run # Windows scripts needs cleanup on audio and vision, todo remove this once resolved 2025-05-07T19:47:33.8716290Z # Windows scripts needs cleanup on audio and vision, todo remove this once resolved 2025-05-07T19:47:33.8716765Z if [[ disabled == 'disabled' ]]; then 2025-05-07T19:47:33.8717071Z  set -euxo pipefail 2025-05-07T19:47:33.8717317Z fi 2025-05-07T19:47:33.8717525Z source "${BUILD_ENV_FILE}" 2025-05-07T19:47:33.8717796Z  2025-05-07T19:47:33.8717996Z if [[ ! -f ${SCRIPT} ]]; then 2025-05-07T19:47:33.8718434Z  echo "::error::Specified script file (${SCRIPT}) not found, not going execute it" 2025-05-07T19:47:33.8718852Z  exit 1 2025-05-07T19:47:33.8719047Z else 2025-05-07T19:47:33.8719273Z  if [[ ${SCRIPT} == *.bat ]]; then 2025-05-07T19:47:33.8719562Z  ${CONDA_RUN} ${SCRIPT} 2025-05-07T19:47:33.8719824Z  else 2025-05-07T19:47:33.8720210Z  ${CONDA_RUN} bash ${SCRIPT} 2025-05-07T19:47:33.8720474Z  fi 2025-05-07T19:47:33.8720655Z fi 2025-05-07T19:47:33.8720906Z shell: bash -l {0} 2025-05-07T19:47:33.8721101Z env: 2025-05-07T19:47:33.8721288Z PYTHON_VERSION: 3.9 2025-05-07T19:47:33.8721513Z PACKAGE_TYPE: wheel 2025-05-07T19:47:33.8721740Z REPOSITORY: pytorch/FBGEMM 2025-05-07T19:47:33.8721975Z REF: 2025-05-07T19:47:33.8722146Z CU_VERSION: cu128 2025-05-07T19:47:33.8722551Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T19:47:33.8722783Z ARCH: x86_64 2025-05-07T19:47:33.8722980Z BUILD_TARGET: genai 2025-05-07T19:47:33.8723188Z CHANNEL: nightly 2025-05-07T19:47:33.8723436Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:47:33.8723770Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T19:47:33.8724116Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:33.8724513Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:33.8724904Z SCRIPT: ../.github/scripts/nova_prescript.bash 2025-05-07T19:47:33.8725199Z ##[endgroup] 2025-05-07T19:47:34.0750325Z + source /__w/_temp/build_env_14891846312 2025-05-07T19:47:34.0750810Z ++ export BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T19:47:34.0751243Z ++ BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T19:47:34.0751642Z ++ export CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T19:47:34.0751928Z ++ CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T19:47:34.0752341Z ++ export CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T19:47:34.0752631Z ++ CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T19:47:34.0752890Z ++ export FORCE_CUDA=1 2025-05-07T19:47:34.0753105Z ++ FORCE_CUDA=1 2025-05-07T19:47:34.0753893Z ++ export PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:47:34.0755268Z ++ PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:47:34.0756725Z ++ export PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:47:34.0758263Z ++ PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T19:47:34.0759430Z ++ export 'PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T19:47:34.0760181Z ++ PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T19:47:34.0760771Z ++ export PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T19:47:34.0761220Z ++ PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T19:47:34.0761573Z ++ export PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T19:47:34.0761919Z ++ PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T19:47:34.0762271Z ++ export 'TORCH_CUDA_ARCH_LIST=5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:47:34.0762707Z ++ TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:47:34.0763047Z ++ export VERSION_SUFFIX=+cu128 2025-05-07T19:47:34.0763308Z ++ VERSION_SUFFIX=+cu128 2025-05-07T19:47:34.0763557Z ++ export WHEEL_DIR=cu128/ 2025-05-07T19:47:34.0763796Z ++ WHEEL_DIR=cu128/ 2025-05-07T19:47:34.0764010Z ++ FBGEMM_DIR=/__w/FBGEMM/FBGEMM 2025-05-07T19:47:34.0764318Z ++ export FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:47:34.0764670Z ++ FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:47:34.0764953Z +++ pwd 2025-05-07T19:47:34.0765175Z ++ working_dir=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:47:34.0765716Z ++ [[ /__w/FBGEMM/FBGEMM/pytorch/FBGEMM == \/\_\_\w\/\F\B\G\E\M\M\/\F\B\G\E\M\M\/\p\y\t\o\r\c\h\/\F\B\G\E\M\M ]] 2025-05-07T19:47:34.0766504Z ++ cd fbgemm_gpu 2025-05-07T19:47:34.0766709Z ++ export BUILD_FROM_NOVA=1 2025-05-07T19:47:34.0766948Z ++ BUILD_FROM_NOVA=1 2025-05-07T19:47:34.0767174Z ++ [[ cu128 == \c\u* ]] 2025-05-07T19:47:34.0767554Z ++ echo 'Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T19:47:34.0767997Z ++ [[ /__w/_temp/conda_environment_14891846312 != '' ]] 2025-05-07T19:47:34.0768672Z ++ export 'CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:47:34.0769297Z ++ CONDA_RUN='conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:47:34.0769854Z ++ echo 'conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T19:47:34.0770377Z Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0 2025-05-07T19:47:34.0770872Z conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:34.0771275Z ++ [[ cu128 == \c\u\1\2\8 ]] 2025-05-07T19:47:34.0771574Z ++ export 'TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T19:47:34.0771934Z ++ TORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T19:47:34.0772311Z ++ echo 'Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T19:47:34.0772685Z + [[ ! -f ../.github/scripts/nova_prescript.bash ]] 2025-05-07T19:47:34.0773047Z + [[ ../.github/scripts/nova_prescript.bash == *.bat ]] 2025-05-07T19:47:34.0773631Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 bash ../.github/scripts/nova_prescript.bash 2025-05-07T19:47:34.0774228Z Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T19:47:35.0371215Z [NOVA] Current working directory: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:47:35.5984183Z ################################################################################ 2025-05-07T19:47:35.5984558Z Environment Variables: 2025-05-07T19:47:35.6011647Z CONDA_SHLVL=2 2025-05-07T19:47:35.6012547Z LD_LIBRARY_PATH=/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: 2025-05-07T19:47:35.6013570Z CONDA_EXE=/opt/conda/bin/conda 2025-05-07T19:47:35.6014055Z KERN_NAME=Linux 2025-05-07T19:47:35.6014424Z ARCH=x86_64 2025-05-07T19:47:35.6014889Z MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD 2025-05-07T19:47:35.6015487Z LANG=en_US.UTF-8 2025-05-07T19:47:35.6015898Z HISTCONTROL=ignoredups 2025-05-07T19:47:35.6016354Z HOSTNAME=d075a29d5e53 2025-05-07T19:47:35.6016795Z GITHUB_REF_NAME=4066/merge 2025-05-07T19:47:35.6017347Z OLDPWD=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:47:35.6018038Z GITHUB_API_URL=https://api.github.com 2025-05-07T19:47:35.6018372Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T19:47:35.6018644Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T19:47:35.6018905Z CHANNEL=nightly 2025-05-07T19:47:35.6019320Z GITHUB_STEP_SUMMARY=/__w/_temp/_runner_file_commands/step_summary_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6019817Z CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T19:47:35.6020241Z GITHUB_ACTION_PATH=/__w/FBGEMM/FBGEMM/./test-infra/.github/actions/run-script-with-cache 2025-05-07T19:47:35.6020685Z GITHUB_RUN_ATTEMPT=1 2025-05-07T19:47:35.6020902Z MACHINE_NAME_LC=x86_64 2025-05-07T19:47:35.6021132Z RUNNER_TOOL_CACHE=/__w/_tool 2025-05-07T19:47:35.6021528Z CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6022011Z CONDA_PREFIX=/__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6022343Z BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T19:47:35.6022645Z CONDA_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6022962Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T19:47:35.6023203Z MACHINE_NAME=x86_64 2025-05-07T19:47:35.6023425Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T19:47:35.6023667Z GITHUB_ACTIONS=true 2025-05-07T19:47:35.6023869Z KERN_NAME_LC=linux 2025-05-07T19:47:35.6024327Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/build_wheels_genai_linux_x86.yml@refs/pull/4066/merge 2025-05-07T19:47:35.6025109Z _CE_M= 2025-05-07T19:47:35.6025299Z which_declare=declare -f 2025-05-07T19:47:35.6025514Z CI=true 2025-05-07T19:47:35.6025750Z MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl 2025-05-07T19:47:35.6026051Z USER=root 2025-05-07T19:47:35.6026256Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T19:47:35.6026521Z CONDA_PREFIX_1=/opt/conda 2025-05-07T19:47:35.6026744Z CU_VERSION=cu128 2025-05-07T19:47:35.6027094Z GITHUB_ACTOR=q10 2025-05-07T19:47:35.6027299Z GITHUB_ACTION_REF= 2025-05-07T19:47:35.6027501Z GITHUB_ACTION=__self_3 2025-05-07T19:47:35.6027759Z GITHUB_REF_PROTECTED=false 2025-05-07T19:47:35.6028009Z WHEEL_DIR=cu128/ 2025-05-07T19:47:35.6028855Z *** 2025-05-07T19:47:35.6029045Z VERSION_SUFFIX=+cu128 2025-05-07T19:47:35.6029255Z HOME=/github/home 2025-05-07T19:47:35.6029483Z CONDA_PYTHON_EXE=/opt/conda/bin/python 2025-05-07T19:47:35.6029933Z GITHUB_STATE=/__w/_temp/_runner_file_commands/save_state_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6030418Z ARTIFACT_NAME=pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:47:35.6030718Z CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T19:47:35.6030977Z GITHUB_ACTION_REPOSITORY= 2025-05-07T19:47:35.6031206Z GITHUB_REF_TYPE=branch 2025-05-07T19:47:35.6031433Z RUNNER_TEMP=/__w/_temp 2025-05-07T19:47:35.6031648Z BUILD_FROM_NOVA=1 2025-05-07T19:47:35.6031852Z GITHUB_RETENTION_DAYS=90 2025-05-07T19:47:35.6032199Z REF= 2025-05-07T19:47:35.6032539Z GITHUB_ENV=/__w/_temp/_runner_file_commands/set_env_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6032979Z SSL_CERT_FILE=/opt/_internal/certs.pem 2025-05-07T19:47:35.6033256Z RUNNER_WORKSPACE=/__w/FBGEMM 2025-05-07T19:47:35.6033512Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T19:47:35.6033801Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T19:47:35.6034098Z _CE_CONDA= 2025-05-07T19:47:35.6034292Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T19:47:35.6034543Z GITHUB_RUN_ID=14891846312 2025-05-07T19:47:35.6034813Z FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:47:35.6035138Z BUILD_ENV_FILE=/__w/_temp/build_env_14891846312 2025-05-07T19:47:35.6035425Z RUNNER_ARCH=X64 2025-05-07T19:47:35.6035647Z GITHUB_SERVER_URL=https://github.com 2025-05-07T19:47:35.6036153Z PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128 2025-05-07T19:47:35.6036645Z REPOSITORY=pytorch/FBGEMM 2025-05-07T19:47:35.6036885Z GITHUB_ACTOR_ID=255046 2025-05-07T19:47:35.6037097Z LOADEDMODULES= 2025-05-07T19:47:35.6037308Z UPLOAD_TO_BASE_BUCKET=no 2025-05-07T19:47:35.6037576Z GITHUB_EVENT_PATH=/github/workflow/event.json 2025-05-07T19:47:35.6038008Z CONDA_PROMPT_MODIFIER=(/__w/_temp/conda_environment_14891846312) 2025-05-07T19:47:35.6038376Z PLATFORM_NAME=Linux-x86_64 2025-05-07T19:47:35.6038607Z PACKAGE_TYPE=wheel 2025-05-07T19:47:35.6038870Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T19:47:35.6039180Z MAIL=/var/spool/mail/root 2025-05-07T19:47:35.6039405Z RUNNER_OS=Linux 2025-05-07T19:47:35.6039600Z GITHUB_BASE_REF=main 2025-05-07T19:47:35.6039823Z FORCE_CUDA=1 2025-05-07T19:47:35.6040050Z TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T19:47:35.6040496Z GITHUB_PATH=/__w/_temp/_runner_file_commands/add_path_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6040917Z GITHUB_JOB=build 2025-05-07T19:47:35.6041112Z BUILD_TARGET=genai 2025-05-07T19:47:35.6041330Z RUNNER_NAME=i-0ca7df191fe9703a4 2025-05-07T19:47:35.6041576Z PYTHON_VERSION=3.9 2025-05-07T19:47:35.6041794Z CONDA_ROOT=/opt/conda 2025-05-07T19:47:35.6042192Z GITHUB_OUTPUT=/__w/_temp/_runner_file_commands/set_output_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6042641Z PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T19:47:35.6042879Z SHLVL=3 2025-05-07T19:47:35.6043057Z LANGUAGE=en_US.UTF-8 2025-05-07T19:47:35.6043291Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T19:47:35.6043538Z MANPATH=: 2025-05-07T19:47:35.6043765Z SCRIPT=../.github/scripts/nova_prescript.bash 2025-05-07T19:47:35.6044168Z GITHUB_EVENT_NAME=pull_request 2025-05-07T19:47:35.6044673Z MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles 2025-05-07T19:47:35.6045175Z LOGNAME=root 2025-05-07T19:47:35.6045598Z MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2 2025-05-07T19:47:35.6046083Z GITHUB_RUN_NUMBER=1266 2025-05-07T19:47:35.6046347Z GITHUB_WORKFLOW=Build FBGEMM GenAI x86 Linux Wheels 2025-05-07T19:47:35.6047657Z PATH=/__w/_temp/conda_environment_14891846312/bin:/opt/conda/condabin:/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/sbin 2025-05-07T19:47:35.6061228Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T19:47:35.6061640Z DEBUGINFOD_URLS=https://debuginfod.centos.org/ 2025-05-07T19:47:35.6061984Z GITHUB_WORKSPACE=/__w/FBGEMM/FBGEMM 2025-05-07T19:47:35.6062282Z MODULESHOME=/usr/share/Modules 2025-05-07T19:47:35.6062606Z CONDA_DEFAULT_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6062956Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T19:47:35.6063192Z HISTSIZE=1000 2025-05-07T19:47:35.6063464Z PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T19:47:35.6063808Z LESSOPEN=||/usr/bin/lesspipe.sh %s 2025-05-07T19:47:35.6064086Z BASH_FUNC_which%%=() { ( alias; 2025-05-07T19:47:35.6064575Z eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@ 2025-05-07T19:47:35.6065062Z } 2025-05-07T19:47:35.6065269Z BASH_FUNC_module%%=() { unset _mlshdbg; 2025-05-07T19:47:35.6065592Z if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then 2025-05-07T19:47:35.6065901Z case "$-" in 2025-05-07T19:47:35.6066086Z *v*x*) 2025-05-07T19:47:35.6066265Z set +vx; 2025-05-07T19:47:35.6066441Z _mlshdbg='vx' 2025-05-07T19:47:35.6066628Z ;; 2025-05-07T19:47:35.6066790Z *v*) 2025-05-07T19:47:35.6066968Z set +v; 2025-05-07T19:47:35.6067137Z _mlshdbg='v' 2025-05-07T19:47:35.6067322Z ;; 2025-05-07T19:47:35.6067505Z *x*) 2025-05-07T19:47:35.6067690Z set +x; 2025-05-07T19:47:35.6067893Z _mlshdbg='x' 2025-05-07T19:47:35.6068093Z ;; 2025-05-07T19:47:35.6068252Z *) 2025-05-07T19:47:35.6068421Z _mlshdbg='' 2025-05-07T19:47:35.6068599Z ;; 2025-05-07T19:47:35.6068766Z esac; 2025-05-07T19:47:35.6069080Z fi; 2025-05-07T19:47:35.6069265Z unset _mlre _mlIFS; 2025-05-07T19:47:35.6069503Z if [ -n "${IFS+x}" ]; then 2025-05-07T19:47:35.6069726Z _mlIFS=$IFS; 2025-05-07T19:47:35.6069912Z fi; 2025-05-07T19:47:35.6070076Z IFS=' '; 2025-05-07T19:47:35.6070289Z for _mlv in ${MODULES_RUN_QUARANTINE:-}; 2025-05-07T19:47:35.6070552Z do 2025-05-07T19:47:35.6070808Z if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then 2025-05-07T19:47:35.6071169Z if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then 2025-05-07T19:47:35.6071529Z _mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' "; 2025-05-07T19:47:35.6071844Z fi; 2025-05-07T19:47:35.6072162Z _mlrv="MODULES_RUNENV_${_mlv}"; 2025-05-07T19:47:35.6072476Z _mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' "; 2025-05-07T19:47:35.6072777Z fi; 2025-05-07T19:47:35.6072943Z done; 2025-05-07T19:47:35.6073138Z if [ -n "${_mlre:-}" ]; then 2025-05-07T19:47:35.6073562Z eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`; 2025-05-07T19:47:35.6073985Z else 2025-05-07T19:47:35.6074309Z eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`; 2025-05-07T19:47:35.6074680Z fi; 2025-05-07T19:47:35.6074852Z _mlstatus=$?; 2025-05-07T19:47:35.6075055Z if [ -n "${_mlIFS+x}" ]; then 2025-05-07T19:47:35.6075281Z IFS=$_mlIFS; 2025-05-07T19:47:35.6075450Z else 2025-05-07T19:47:35.6075613Z unset IFS; 2025-05-07T19:47:35.6075778Z fi; 2025-05-07T19:47:35.6075952Z unset _mlre _mlv _mlrv _mlIFS; 2025-05-07T19:47:35.6076197Z if [ -n "${_mlshdbg:-}" ]; then 2025-05-07T19:47:35.6076561Z set -$_mlshdbg; 2025-05-07T19:47:35.6076739Z fi; 2025-05-07T19:47:35.6076899Z unset _mlshdbg; 2025-05-07T19:47:35.6077087Z return $_mlstatus 2025-05-07T19:47:35.6077271Z } 2025-05-07T19:47:35.6077480Z BASH_FUNC_switchml%%=() { typeset swfound=1; 2025-05-07T19:47:35.6077810Z if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then 2025-05-07T19:47:35.6078112Z typeset swname='main'; 2025-05-07T19:47:35.6078393Z if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then 2025-05-07T19:47:35.6078703Z typeset swfound=0; 2025-05-07T19:47:35.6079041Z unset MODULES_USE_COMPAT_VERSION; 2025-05-07T19:47:35.6079288Z fi; 2025-05-07T19:47:35.6079443Z else 2025-05-07T19:47:35.6079631Z typeset swname='compatibility'; 2025-05-07T19:47:35.6079957Z if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then 2025-05-07T19:47:35.6080285Z typeset swfound=0; 2025-05-07T19:47:35.6080506Z MODULES_USE_COMPAT_VERSION=1; 2025-05-07T19:47:35.6080766Z export MODULES_USE_COMPAT_VERSION; 2025-05-07T19:47:35.6081018Z fi; 2025-05-07T19:47:35.6081183Z fi; 2025-05-07T19:47:35.6081370Z if [ $swfound -eq 0 ]; then 2025-05-07T19:47:35.6081647Z echo "Switching to Modules $swname version"; 2025-05-07T19:47:35.6081953Z source /usr/share/Modules/init/bash; 2025-05-07T19:47:35.6082210Z else 2025-05-07T19:47:35.6082493Z echo "Cannot switch to Modules $swname version, command not found"; 2025-05-07T19:47:35.6082846Z return 1; 2025-05-07T19:47:35.6083014Z fi 2025-05-07T19:47:35.6083178Z } 2025-05-07T19:47:35.6083428Z BASH_FUNC_scl%%=() { if [ "$1" = "load" -o "$1" = "unload" ]; then 2025-05-07T19:47:35.6083763Z eval "module $@"; 2025-05-07T19:47:35.6083951Z else 2025-05-07T19:47:35.6084129Z /usr/bin/scl "$@"; 2025-05-07T19:47:35.6084324Z fi 2025-05-07T19:47:35.6084488Z } 2025-05-07T19:47:35.6084664Z BASH_FUNC_ml%%=() { module ml "$@" 2025-05-07T19:47:35.6085226Z } 2025-05-07T19:47:35.6085401Z _=/usr/bin/printenv 2025-05-07T19:47:35.6085639Z ################################################################################ 2025-05-07T19:47:35.6085958Z ################################################################################ 2025-05-07T19:47:35.6086239Z # Print System Info 2025-05-07T19:47:35.6086433Z # 2025-05-07T19:47:35.6086635Z # [2025-05-07T19:47:35.603Z] + print_system_info 2025-05-07T19:47:35.6086946Z ################################################################################ 2025-05-07T19:47:35.6087147Z 2025-05-07T19:47:35.6087253Z ################################################################################ 2025-05-07T19:47:35.6087571Z [INFO] Printing environment variables ... 2025-05-07T19:47:35.6087890Z + printenv 2025-05-07T19:47:35.6088001Z 2025-05-07T19:47:35.6088132Z CONDA_SHLVL=2 2025-05-07T19:47:35.6088510Z LD_LIBRARY_PATH=/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: 2025-05-07T19:47:35.6088974Z CONDA_EXE=/opt/conda/bin/conda 2025-05-07T19:47:35.6089210Z KERN_NAME=Linux 2025-05-07T19:47:35.6089407Z ARCH=x86_64 2025-05-07T19:47:35.6089647Z MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD 2025-05-07T19:47:35.6089951Z LANG=en_US.UTF-8 2025-05-07T19:47:35.6090160Z HISTCONTROL=ignoredups 2025-05-07T19:47:35.6090386Z HOSTNAME=d075a29d5e53 2025-05-07T19:47:35.6090610Z GITHUB_REF_NAME=4066/merge 2025-05-07T19:47:35.6090887Z OLDPWD=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:47:35.6091231Z GITHUB_API_URL=https://api.github.com 2025-05-07T19:47:35.6091517Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T19:47:35.6091781Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T19:47:35.6092037Z CHANNEL=nightly 2025-05-07T19:47:35.6092454Z GITHUB_STEP_SUMMARY=/__w/_temp/_runner_file_commands/step_summary_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6092937Z CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T19:47:35.6093360Z GITHUB_ACTION_PATH=/__w/FBGEMM/FBGEMM/./test-infra/.github/actions/run-script-with-cache 2025-05-07T19:47:35.6093801Z GITHUB_RUN_ATTEMPT=1 2025-05-07T19:47:35.6094015Z MACHINE_NAME_LC=x86_64 2025-05-07T19:47:35.6094245Z RUNNER_TOOL_CACHE=/__w/_tool 2025-05-07T19:47:35.6094804Z CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6095283Z CONDA_PREFIX=/__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6095603Z BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T19:47:35.6095911Z CONDA_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6096230Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T19:47:35.6096474Z MACHINE_NAME=x86_64 2025-05-07T19:47:35.6096703Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T19:47:35.6097072Z GITHUB_ACTIONS=true 2025-05-07T19:47:35.6097280Z KERN_NAME_LC=linux 2025-05-07T19:47:35.6097740Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/build_wheels_genai_linux_x86.yml@refs/pull/4066/merge 2025-05-07T19:47:35.6098245Z _CE_M= 2025-05-07T19:47:35.6098428Z which_declare=declare -f 2025-05-07T19:47:35.6098645Z CI=true 2025-05-07T19:47:35.6098878Z MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl 2025-05-07T19:47:35.6099185Z USER=root 2025-05-07T19:47:35.6099387Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T19:47:35.6099659Z CONDA_PREFIX_1=/opt/conda 2025-05-07T19:47:35.6099886Z CU_VERSION=cu128 2025-05-07T19:47:35.6100079Z GITHUB_ACTOR=q10 2025-05-07T19:47:35.6100281Z GITHUB_ACTION_REF= 2025-05-07T19:47:35.6100483Z GITHUB_ACTION=__self_3 2025-05-07T19:47:35.6100713Z GITHUB_REF_PROTECTED=false 2025-05-07T19:47:35.6100940Z WHEEL_DIR=cu128/ 2025-05-07T19:47:35.6101239Z *** 2025-05-07T19:47:35.6101416Z VERSION_SUFFIX=+cu128 2025-05-07T19:47:35.6101634Z HOME=/github/home 2025-05-07T19:47:35.6101868Z CONDA_PYTHON_EXE=/opt/conda/bin/python 2025-05-07T19:47:35.6102319Z GITHUB_STATE=/__w/_temp/_runner_file_commands/save_state_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6102798Z ARTIFACT_NAME=pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T19:47:35.6103103Z CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T19:47:35.6103359Z GITHUB_ACTION_REPOSITORY= 2025-05-07T19:47:35.6103592Z GITHUB_REF_TYPE=branch 2025-05-07T19:47:35.6103815Z RUNNER_TEMP=/__w/_temp 2025-05-07T19:47:35.6104024Z BUILD_FROM_NOVA=1 2025-05-07T19:47:35.6104240Z GITHUB_RETENTION_DAYS=90 2025-05-07T19:47:35.6104463Z REF= 2025-05-07T19:47:35.6104790Z GITHUB_ENV=/__w/_temp/_runner_file_commands/set_env_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6105227Z SSL_CERT_FILE=/opt/_internal/certs.pem 2025-05-07T19:47:35.6105508Z RUNNER_WORKSPACE=/__w/FBGEMM 2025-05-07T19:47:35.6105762Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T19:47:35.6106047Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T19:47:35.6106347Z _CE_CONDA= 2025-05-07T19:47:35.6106546Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T19:47:35.6106795Z GITHUB_RUN_ID=14891846312 2025-05-07T19:47:35.6107053Z FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T19:47:35.6107381Z BUILD_ENV_FILE=/__w/_temp/build_env_14891846312 2025-05-07T19:47:35.6107667Z RUNNER_ARCH=X64 2025-05-07T19:47:35.6107886Z GITHUB_SERVER_URL=https://github.com 2025-05-07T19:47:35.6108391Z PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128 2025-05-07T19:47:35.6108887Z REPOSITORY=pytorch/FBGEMM 2025-05-07T19:47:35.6109124Z GITHUB_ACTOR_ID=255046 2025-05-07T19:47:35.6109334Z LOADEDMODULES= 2025-05-07T19:47:35.6109540Z UPLOAD_TO_BASE_BUCKET=no 2025-05-07T19:47:35.6109802Z GITHUB_EVENT_PATH=/github/workflow/event.json 2025-05-07T19:47:35.6110198Z CONDA_PROMPT_MODIFIER=(/__w/_temp/conda_environment_14891846312) 2025-05-07T19:47:35.6110577Z PLATFORM_NAME=Linux-x86_64 2025-05-07T19:47:35.6110813Z PACKAGE_TYPE=wheel 2025-05-07T19:47:35.6111083Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T19:47:35.6111397Z MAIL=/var/spool/mail/root 2025-05-07T19:47:35.6111621Z RUNNER_OS=Linux 2025-05-07T19:47:35.6111815Z GITHUB_BASE_REF=main 2025-05-07T19:47:35.6112119Z FORCE_CUDA=1 2025-05-07T19:47:35.6112336Z TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T19:47:35.6112789Z GITHUB_PATH=/__w/_temp/_runner_file_commands/add_path_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6113203Z GITHUB_JOB=build 2025-05-07T19:47:35.6113501Z BUILD_TARGET=genai 2025-05-07T19:47:35.6113720Z RUNNER_NAME=i-0ca7df191fe9703a4 2025-05-07T19:47:35.6113963Z PYTHON_VERSION=3.9 2025-05-07T19:47:35.6114172Z CONDA_ROOT=/opt/conda 2025-05-07T19:47:35.6114564Z GITHUB_OUTPUT=/__w/_temp/_runner_file_commands/set_output_0285c72b-f326-4d23-9fd5-b829840b5651 2025-05-07T19:47:35.6115022Z PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T19:47:35.6115255Z SHLVL=3 2025-05-07T19:47:35.6115439Z LANGUAGE=en_US.UTF-8 2025-05-07T19:47:35.6115673Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T19:47:35.6116015Z MANPATH=: 2025-05-07T19:47:35.6116239Z SCRIPT=../.github/scripts/nova_prescript.bash 2025-05-07T19:47:35.6116553Z GITHUB_EVENT_NAME=pull_request 2025-05-07T19:47:35.6117060Z MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles 2025-05-07T19:47:35.6117567Z LOGNAME=root 2025-05-07T19:47:35.6117998Z MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2 2025-05-07T19:47:35.6118489Z GITHUB_RUN_NUMBER=1266 2025-05-07T19:47:35.6118766Z GITHUB_WORKFLOW=Build FBGEMM GenAI x86 Linux Wheels 2025-05-07T19:47:35.6119996Z PATH=/__w/_temp/conda_environment_14891846312/bin:/opt/conda/condabin:/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/sbin 2025-05-07T19:47:35.6121256Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T19:47:35.6121669Z DEBUGINFOD_URLS=https://debuginfod.centos.org/ 2025-05-07T19:47:35.6121995Z GITHUB_WORKSPACE=/__w/FBGEMM/FBGEMM 2025-05-07T19:47:35.6122276Z MODULESHOME=/usr/share/Modules 2025-05-07T19:47:35.6122590Z CONDA_DEFAULT_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T19:47:35.6122935Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T19:47:35.6123165Z HISTSIZE=1000 2025-05-07T19:47:35.6123430Z PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T19:47:35.6123766Z LESSOPEN=||/usr/bin/lesspipe.sh %s 2025-05-07T19:47:35.6124044Z BASH_FUNC_which%%=() { ( alias; 2025-05-07T19:47:35.6124529Z eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@ 2025-05-07T19:47:35.6125003Z } 2025-05-07T19:47:35.6125200Z BASH_FUNC_module%%=() { unset _mlshdbg; 2025-05-07T19:47:35.6125518Z if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then 2025-05-07T19:47:35.6125813Z case "$-" in 2025-05-07T19:47:35.6125997Z *v*x*) 2025-05-07T19:47:35.6126172Z set +vx; 2025-05-07T19:47:35.6126343Z _mlshdbg='vx' 2025-05-07T19:47:35.6126530Z ;; 2025-05-07T19:47:35.6126688Z *v*) 2025-05-07T19:47:35.6126860Z set +v; 2025-05-07T19:47:35.6127035Z _mlshdbg='v' 2025-05-07T19:47:35.6127212Z ;; 2025-05-07T19:47:35.6127374Z *x*) 2025-05-07T19:47:35.6127535Z set +x; 2025-05-07T19:47:35.6127733Z _mlshdbg='x' 2025-05-07T19:47:35.6127929Z ;; 2025-05-07T19:47:35.6128084Z *) 2025-05-07T19:47:35.6128243Z _mlshdbg='' 2025-05-07T19:47:35.6128424Z ;; 2025-05-07T19:47:35.6128590Z esac; 2025-05-07T19:47:35.6128758Z fi; 2025-05-07T19:47:35.6128924Z unset _mlre _mlIFS; 2025-05-07T19:47:35.6129151Z if [ -n "${IFS+x}" ]; then 2025-05-07T19:47:35.6129374Z _mlIFS=$IFS; 2025-05-07T19:47:35.6129556Z fi; 2025-05-07T19:47:35.6129715Z IFS=' '; 2025-05-07T19:47:35.6129927Z for _mlv in ${MODULES_RUN_QUARANTINE:-}; 2025-05-07T19:47:35.6130198Z do 2025-05-07T19:47:35.6130450Z if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then 2025-05-07T19:47:35.6130819Z if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then 2025-05-07T19:47:35.6131161Z _mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' "; 2025-05-07T19:47:35.6131477Z fi; 2025-05-07T19:47:35.6131664Z _mlrv="MODULES_RUNENV_${_mlv}"; 2025-05-07T19:47:35.6131970Z _mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' "; 2025-05-07T19:47:35.6132264Z fi; 2025-05-07T19:47:35.6132434Z done; 2025-05-07T19:47:35.6132621Z if [ -n "${_mlre:-}" ]; then 2025-05-07T19:47:35.6133029Z eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`; 2025-05-07T19:47:35.6133549Z else 2025-05-07T19:47:35.6133858Z eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`; 2025-05-07T19:47:35.6134234Z fi; 2025-05-07T19:47:35.6134398Z _mlstatus=$?; 2025-05-07T19:47:35.6134602Z if [ -n "${_mlIFS+x}" ]; then 2025-05-07T19:47:35.6134832Z IFS=$_mlIFS; 2025-05-07T19:47:35.6135014Z else 2025-05-07T19:47:35.6135180Z unset IFS; 2025-05-07T19:47:35.6135355Z fi; 2025-05-07T19:47:35.6135634Z unset _mlre _mlv _mlrv _mlIFS; 2025-05-07T19:47:35.6135886Z if [ -n "${_mlshdbg:-}" ]; then 2025-05-07T19:47:35.6136127Z set -$_mlshdbg; 2025-05-07T19:47:35.6136310Z fi; 2025-05-07T19:47:35.6136479Z unset _mlshdbg; 2025-05-07T19:47:35.6136672Z return $_mlstatus 2025-05-07T19:47:35.6136863Z } 2025-05-07T19:47:35.6137071Z BASH_FUNC_switchml%%=() { typeset swfound=1; 2025-05-07T19:47:35.6137410Z if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then 2025-05-07T19:47:35.6137716Z typeset swname='main'; 2025-05-07T19:47:35.6138006Z if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then 2025-05-07T19:47:35.6138336Z typeset swfound=0; 2025-05-07T19:47:35.6138567Z unset MODULES_USE_COMPAT_VERSION; 2025-05-07T19:47:35.6138814Z fi; 2025-05-07T19:47:35.6138972Z else 2025-05-07T19:47:35.6139171Z typeset swname='compatibility'; 2025-05-07T19:47:35.6139498Z if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then 2025-05-07T19:47:35.6139835Z typeset swfound=0; 2025-05-07T19:47:35.6140060Z MODULES_USE_COMPAT_VERSION=1; 2025-05-07T19:47:35.6140325Z export MODULES_USE_COMPAT_VERSION; 2025-05-07T19:47:35.6140569Z fi; 2025-05-07T19:47:35.6140731Z fi; 2025-05-07T19:47:35.6140910Z if [ $swfound -eq 0 ]; then 2025-05-07T19:47:35.6141188Z echo "Switching to Modules $swname version"; 2025-05-07T19:47:35.6141503Z source /usr/share/Modules/init/bash; 2025-05-07T19:47:35.6141760Z else 2025-05-07T19:47:35.6142049Z echo "Cannot switch to Modules $swname version, command not found"; 2025-05-07T19:47:35.6142404Z return 1; 2025-05-07T19:47:35.6142580Z fi 2025-05-07T19:47:35.6142738Z } 2025-05-07T19:47:35.6142992Z BASH_FUNC_scl%%=() { if [ "$1" = "load" -o "$1" = "unload" ]; then 2025-05-07T19:47:35.6143318Z eval "module $@"; 2025-05-07T19:47:35.6143518Z else 2025-05-07T19:47:35.6143691Z /usr/bin/scl "$@"; 2025-05-07T19:47:35.6143889Z fi 2025-05-07T19:47:35.6144051Z } 2025-05-07T19:47:35.6144239Z BASH_FUNC_ml%%=() { module ml "$@" 2025-05-07T19:47:35.6144487Z } 2025-05-07T19:47:35.6144662Z _=/usr/bin/printenv 2025-05-07T19:47:35.6144794Z 2025-05-07T19:47:35.6144907Z ################################################################################ 2025-05-07T19:47:35.6145207Z [INFO] Print ldd version ... 2025-05-07T19:47:35.6145452Z + ldd --version 2025-05-07T19:47:35.6145573Z 2025-05-07T19:47:35.6145667Z ldd (GNU libc) 2.28 2025-05-07T19:47:35.6145929Z Copyright (C) 2018 Free Software Foundation, Inc. 2025-05-07T19:47:35.6146352Z This is free software; see the source for copying conditions. There is NO 2025-05-07T19:47:35.6146879Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T19:47:35.6147324Z Written by Roland McGrath and Ulrich Drepper. 2025-05-07T19:47:35.6147541Z 2025-05-07T19:47:35.6147645Z ################################################################################ 2025-05-07T19:47:35.6147941Z [INFO] Print CPU info ... 2025-05-07T19:47:35.6148162Z + nproc 2025-05-07T19:47:35.6148270Z 2025-05-07T19:47:35.6148339Z 16 2025-05-07T19:47:35.6148434Z 2025-05-07T19:47:35.6148508Z + lscpu 2025-05-07T19:47:35.6148613Z 2025-05-07T19:47:35.6205258Z Architecture: x86_64 2025-05-07T19:47:35.6205538Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T19:47:35.6205807Z Byte Order: Little Endian 2025-05-07T19:47:35.6206062Z CPU(s): 16 2025-05-07T19:47:35.6206285Z On-line CPU(s) list: 0-15 2025-05-07T19:47:35.6206520Z Thread(s) per core: 2 2025-05-07T19:47:35.6206737Z Core(s) per socket: 8 2025-05-07T19:47:35.6206967Z Socket(s): 1 2025-05-07T19:47:35.6207380Z NUMA node(s): 1 2025-05-07T19:47:35.6207609Z Vendor ID: AuthenticAMD 2025-05-07T19:47:35.6207868Z CPU family: 23 2025-05-07T19:47:35.6208084Z Model: 49 2025-05-07T19:47:35.6208313Z Model name: AMD EPYC 7R32 2025-05-07T19:47:35.6208562Z Stepping: 0 2025-05-07T19:47:35.6208784Z CPU MHz: 3294.303 2025-05-07T19:47:35.6209021Z BogoMIPS: 5599.99 2025-05-07T19:47:35.6209266Z Hypervisor vendor: KVM 2025-05-07T19:47:35.6209634Z Virtualization type: full 2025-05-07T19:47:35.6209880Z L1d cache: 32K 2025-05-07T19:47:35.6210101Z L1i cache: 32K 2025-05-07T19:47:35.6210318Z L2 cache: 512K 2025-05-07T19:47:35.6210545Z L3 cache: 16384K 2025-05-07T19:47:35.6210775Z NUMA node0 CPU(s): 0-15 2025-05-07T19:47:35.6212798Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6214794Z 2025-05-07T19:47:35.6214880Z + cat /proc/cpuinfo 2025-05-07T19:47:35.6215012Z 2025-05-07T19:47:35.6233410Z processor : 0 2025-05-07T19:47:35.6234469Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6234905Z cpu family : 23 2025-05-07T19:47:35.6235179Z model : 49 2025-05-07T19:47:35.6235692Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6235984Z stepping : 0 2025-05-07T19:47:35.6236268Z microcode : 0x830107f 2025-05-07T19:47:35.6236556Z cpu MHz : 3293.309 2025-05-07T19:47:35.6236843Z cache size : 512 KB 2025-05-07T19:47:35.6237154Z physical id : 0 2025-05-07T19:47:35.6237415Z siblings : 16 2025-05-07T19:47:35.6237601Z core id : 0 2025-05-07T19:47:35.6237794Z cpu cores : 8 2025-05-07T19:47:35.6237982Z apicid : 0 2025-05-07T19:47:35.6238227Z initial apicid : 0 2025-05-07T19:47:35.6238457Z fpu : yes 2025-05-07T19:47:35.6238695Z fpu_exception : yes 2025-05-07T19:47:35.6238994Z cpuid level : 13 2025-05-07T19:47:35.6239236Z wp : yes 2025-05-07T19:47:35.6241310Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6243659Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6244145Z bogomips : 5599.99 2025-05-07T19:47:35.6244363Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6244592Z clflush size : 64 2025-05-07T19:47:35.6244796Z cache_alignment : 64 2025-05-07T19:47:35.6245063Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6245371Z power management: 2025-05-07T19:47:35.6245511Z 2025-05-07T19:47:35.6245598Z processor : 1 2025-05-07T19:47:35.6245802Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6246034Z cpu family : 23 2025-05-07T19:47:35.6246228Z model : 49 2025-05-07T19:47:35.6246431Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6246656Z stepping : 0 2025-05-07T19:47:35.6246859Z microcode : 0x830107f 2025-05-07T19:47:35.6247079Z cpu MHz : 3299.062 2025-05-07T19:47:35.6247279Z cache size : 512 KB 2025-05-07T19:47:35.6247491Z physical id : 0 2025-05-07T19:47:35.6247684Z siblings : 16 2025-05-07T19:47:35.6248108Z core id : 1 2025-05-07T19:47:35.6248290Z cpu cores : 8 2025-05-07T19:47:35.6248489Z apicid : 2 2025-05-07T19:47:35.6248678Z initial apicid : 2 2025-05-07T19:47:35.6248885Z fpu : yes 2025-05-07T19:47:35.6249068Z fpu_exception : yes 2025-05-07T19:47:35.6249286Z cpuid level : 13 2025-05-07T19:47:35.6249478Z wp : yes 2025-05-07T19:47:35.6260995Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6263319Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6263824Z bogomips : 5599.99 2025-05-07T19:47:35.6264045Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6264285Z clflush size : 64 2025-05-07T19:47:35.6264493Z cache_alignment : 64 2025-05-07T19:47:35.6264768Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6265078Z power management: 2025-05-07T19:47:35.6265228Z 2025-05-07T19:47:35.6265308Z processor : 2 2025-05-07T19:47:35.6265514Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6265756Z cpu family : 23 2025-05-07T19:47:35.6265954Z model : 49 2025-05-07T19:47:35.6266163Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6266393Z stepping : 0 2025-05-07T19:47:35.6266599Z microcode : 0x830107f 2025-05-07T19:47:35.6266821Z cpu MHz : 3301.175 2025-05-07T19:47:35.6267094Z cache size : 512 KB 2025-05-07T19:47:35.6267355Z physical id : 0 2025-05-07T19:47:35.6267556Z siblings : 16 2025-05-07T19:47:35.6267754Z core id : 2 2025-05-07T19:47:35.6267996Z cpu cores : 8 2025-05-07T19:47:35.6268228Z apicid : 4 2025-05-07T19:47:35.6268416Z initial apicid : 4 2025-05-07T19:47:35.6268622Z fpu : yes 2025-05-07T19:47:35.6268821Z fpu_exception : yes 2025-05-07T19:47:35.6269037Z cpuid level : 13 2025-05-07T19:47:35.6269230Z wp : yes 2025-05-07T19:47:35.6271354Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6273728Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6274221Z bogomips : 5599.99 2025-05-07T19:47:35.6274437Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6274757Z clflush size : 64 2025-05-07T19:47:35.6274962Z cache_alignment : 64 2025-05-07T19:47:35.6275232Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6275539Z power management: 2025-05-07T19:47:35.6275678Z 2025-05-07T19:47:35.6275756Z processor : 3 2025-05-07T19:47:35.6275959Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6276191Z cpu family : 23 2025-05-07T19:47:35.6276387Z model : 49 2025-05-07T19:47:35.6276593Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6276829Z stepping : 0 2025-05-07T19:47:35.6277033Z microcode : 0x830107f 2025-05-07T19:47:35.6277257Z cpu MHz : 3299.565 2025-05-07T19:47:35.6277460Z cache size : 512 KB 2025-05-07T19:47:35.6277674Z physical id : 0 2025-05-07T19:47:35.6277870Z siblings : 16 2025-05-07T19:47:35.6278065Z core id : 3 2025-05-07T19:47:35.6278249Z cpu cores : 8 2025-05-07T19:47:35.6278444Z apicid : 6 2025-05-07T19:47:35.6278631Z initial apicid : 6 2025-05-07T19:47:35.6278960Z fpu : yes 2025-05-07T19:47:35.6279148Z fpu_exception : yes 2025-05-07T19:47:35.6279366Z cpuid level : 13 2025-05-07T19:47:35.6279561Z wp : yes 2025-05-07T19:47:35.6281627Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6283876Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6284354Z bogomips : 5599.99 2025-05-07T19:47:35.6284570Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6285065Z clflush size : 64 2025-05-07T19:47:35.6285340Z cache_alignment : 64 2025-05-07T19:47:35.6285678Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6286064Z power management: 2025-05-07T19:47:35.6286230Z 2025-05-07T19:47:35.6286341Z processor : 4 2025-05-07T19:47:35.6286598Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6286894Z cpu family : 23 2025-05-07T19:47:35.6287118Z model : 49 2025-05-07T19:47:35.6287318Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6287555Z stepping : 0 2025-05-07T19:47:35.6287760Z microcode : 0x830107f 2025-05-07T19:47:35.6287979Z cpu MHz : 3294.634 2025-05-07T19:47:35.6288180Z cache size : 512 KB 2025-05-07T19:47:35.6288388Z physical id : 0 2025-05-07T19:47:35.6288581Z siblings : 16 2025-05-07T19:47:35.6288772Z core id : 4 2025-05-07T19:47:35.6288953Z cpu cores : 8 2025-05-07T19:47:35.6289148Z apicid : 8 2025-05-07T19:47:35.6289335Z initial apicid : 8 2025-05-07T19:47:35.6289542Z fpu : yes 2025-05-07T19:47:35.6289728Z fpu_exception : yes 2025-05-07T19:47:35.6289948Z cpuid level : 13 2025-05-07T19:47:35.6290141Z wp : yes 2025-05-07T19:47:35.6292105Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6294335Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6294813Z bogomips : 5599.99 2025-05-07T19:47:35.6295029Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6295255Z clflush size : 64 2025-05-07T19:47:35.6295465Z cache_alignment : 64 2025-05-07T19:47:35.6295786Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6296126Z power management: 2025-05-07T19:47:35.6296254Z 2025-05-07T19:47:35.6296342Z processor : 5 2025-05-07T19:47:35.6296562Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6296847Z cpu family : 23 2025-05-07T19:47:35.6297043Z model : 49 2025-05-07T19:47:35.6297244Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6297473Z stepping : 0 2025-05-07T19:47:35.6297690Z microcode : 0x830107f 2025-05-07T19:47:35.6297938Z cpu MHz : 3297.035 2025-05-07T19:47:35.6298140Z cache size : 512 KB 2025-05-07T19:47:35.6298343Z physical id : 0 2025-05-07T19:47:35.6298535Z siblings : 16 2025-05-07T19:47:35.6298727Z core id : 5 2025-05-07T19:47:35.6298911Z cpu cores : 8 2025-05-07T19:47:35.6299174Z apicid : 10 2025-05-07T19:47:35.6299362Z initial apicid : 10 2025-05-07T19:47:35.6299566Z fpu : yes 2025-05-07T19:47:35.6299748Z fpu_exception : yes 2025-05-07T19:47:35.6300026Z cpuid level : 13 2025-05-07T19:47:35.6300464Z wp : yes 2025-05-07T19:47:35.6302610Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6304852Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6305329Z bogomips : 5599.99 2025-05-07T19:47:35.6305543Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6305768Z clflush size : 64 2025-05-07T19:47:35.6305969Z cache_alignment : 64 2025-05-07T19:47:35.6306236Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6306542Z power management: 2025-05-07T19:47:35.6306675Z 2025-05-07T19:47:35.6306758Z processor : 6 2025-05-07T19:47:35.6306959Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6307188Z cpu family : 23 2025-05-07T19:47:35.6307382Z model : 49 2025-05-07T19:47:35.6307582Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6307810Z stepping : 0 2025-05-07T19:47:35.6308011Z microcode : 0x830107f 2025-05-07T19:47:35.6308219Z cpu MHz : 3302.784 2025-05-07T19:47:35.6308426Z cache size : 512 KB 2025-05-07T19:47:35.6308653Z physical id : 0 2025-05-07T19:47:35.6308848Z siblings : 16 2025-05-07T19:47:35.6309043Z core id : 6 2025-05-07T19:47:35.6309229Z cpu cores : 8 2025-05-07T19:47:35.6309423Z apicid : 12 2025-05-07T19:47:35.6309615Z initial apicid : 12 2025-05-07T19:47:35.6309827Z fpu : yes 2025-05-07T19:47:35.6310014Z fpu_exception : yes 2025-05-07T19:47:35.6310232Z cpuid level : 13 2025-05-07T19:47:35.6310424Z wp : yes 2025-05-07T19:47:35.6312495Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6314736Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6315211Z bogomips : 5599.99 2025-05-07T19:47:35.6315424Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6315653Z clflush size : 64 2025-05-07T19:47:35.6315857Z cache_alignment : 64 2025-05-07T19:47:35.6316124Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6316424Z power management: 2025-05-07T19:47:35.6316557Z 2025-05-07T19:47:35.6316641Z processor : 7 2025-05-07T19:47:35.6316839Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6317071Z cpu family : 23 2025-05-07T19:47:35.6317265Z model : 49 2025-05-07T19:47:35.6317464Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6317689Z stepping : 0 2025-05-07T19:47:35.6317887Z microcode : 0x830107f 2025-05-07T19:47:35.6318093Z cpu MHz : 3298.515 2025-05-07T19:47:35.6318300Z cache size : 512 KB 2025-05-07T19:47:35.6318507Z physical id : 0 2025-05-07T19:47:35.6318706Z siblings : 16 2025-05-07T19:47:35.6318898Z core id : 7 2025-05-07T19:47:35.6319079Z cpu cores : 8 2025-05-07T19:47:35.6319271Z apicid : 14 2025-05-07T19:47:35.6319459Z initial apicid : 14 2025-05-07T19:47:35.6319663Z fpu : yes 2025-05-07T19:47:35.6319846Z fpu_exception : yes 2025-05-07T19:47:35.6320058Z cpuid level : 13 2025-05-07T19:47:35.6320249Z wp : yes 2025-05-07T19:47:35.6322233Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6324856Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6325336Z bogomips : 5599.99 2025-05-07T19:47:35.6325549Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6325778Z clflush size : 64 2025-05-07T19:47:35.6325982Z cache_alignment : 64 2025-05-07T19:47:35.6326248Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6326619Z power management: 2025-05-07T19:47:35.6326748Z 2025-05-07T19:47:35.6326833Z processor : 8 2025-05-07T19:47:35.6327043Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6327335Z cpu family : 23 2025-05-07T19:47:35.6327539Z model : 49 2025-05-07T19:47:35.6327740Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6327970Z stepping : 0 2025-05-07T19:47:35.6328171Z microcode : 0x830107f 2025-05-07T19:47:35.6328380Z cpu MHz : 3298.646 2025-05-07T19:47:35.6328587Z cache size : 512 KB 2025-05-07T19:47:35.6328798Z physical id : 0 2025-05-07T19:47:35.6328993Z siblings : 16 2025-05-07T19:47:35.6329189Z core id : 0 2025-05-07T19:47:35.6329377Z cpu cores : 8 2025-05-07T19:47:35.6329574Z apicid : 1 2025-05-07T19:47:35.6329760Z initial apicid : 1 2025-05-07T19:47:35.6329968Z fpu : yes 2025-05-07T19:47:35.6330154Z fpu_exception : yes 2025-05-07T19:47:35.6330364Z cpuid level : 13 2025-05-07T19:47:35.6330556Z wp : yes 2025-05-07T19:47:35.6332525Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6334763Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6335234Z bogomips : 5599.99 2025-05-07T19:47:35.6335447Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6335666Z clflush size : 64 2025-05-07T19:47:35.6335878Z cache_alignment : 64 2025-05-07T19:47:35.6336143Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6336439Z power management: 2025-05-07T19:47:35.6336564Z 2025-05-07T19:47:35.6336647Z processor : 9 2025-05-07T19:47:35.6336844Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6337071Z cpu family : 23 2025-05-07T19:47:35.6337269Z model : 49 2025-05-07T19:47:35.6337470Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6337695Z stepping : 0 2025-05-07T19:47:35.6337895Z microcode : 0x830107f 2025-05-07T19:47:35.6338105Z cpu MHz : 3285.673 2025-05-07T19:47:35.6338313Z cache size : 512 KB 2025-05-07T19:47:35.6338521Z physical id : 0 2025-05-07T19:47:35.6338717Z siblings : 16 2025-05-07T19:47:35.6338912Z core id : 1 2025-05-07T19:47:35.6339094Z cpu cores : 8 2025-05-07T19:47:35.6339287Z apicid : 3 2025-05-07T19:47:35.6339474Z initial apicid : 3 2025-05-07T19:47:35.6339680Z fpu : yes 2025-05-07T19:47:35.6339865Z fpu_exception : yes 2025-05-07T19:47:35.6340077Z cpuid level : 13 2025-05-07T19:47:35.6340270Z wp : yes 2025-05-07T19:47:35.6342249Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6344755Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6345227Z bogomips : 5599.99 2025-05-07T19:47:35.6345532Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6345754Z clflush size : 64 2025-05-07T19:47:35.6345964Z cache_alignment : 64 2025-05-07T19:47:35.6346228Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6346528Z power management: 2025-05-07T19:47:35.6346654Z 2025-05-07T19:47:35.6346740Z processor : 10 2025-05-07T19:47:35.6346943Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6347176Z cpu family : 23 2025-05-07T19:47:35.6347366Z model : 49 2025-05-07T19:47:35.6347584Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6347854Z stepping : 0 2025-05-07T19:47:35.6348065Z microcode : 0x830107f 2025-05-07T19:47:35.6348276Z cpu MHz : 3297.848 2025-05-07T19:47:35.6348484Z cache size : 512 KB 2025-05-07T19:47:35.6348698Z physical id : 0 2025-05-07T19:47:35.6348892Z siblings : 16 2025-05-07T19:47:35.6349087Z core id : 2 2025-05-07T19:47:35.6349271Z cpu cores : 8 2025-05-07T19:47:35.6349465Z apicid : 5 2025-05-07T19:47:35.6349651Z initial apicid : 5 2025-05-07T19:47:35.6349855Z fpu : yes 2025-05-07T19:47:35.6350044Z fpu_exception : yes 2025-05-07T19:47:35.6350256Z cpuid level : 13 2025-05-07T19:47:35.6350448Z wp : yes 2025-05-07T19:47:35.6352678Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6355004Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6355536Z bogomips : 5599.99 2025-05-07T19:47:35.6355764Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6355984Z clflush size : 64 2025-05-07T19:47:35.6356198Z cache_alignment : 64 2025-05-07T19:47:35.6356467Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6356767Z power management: 2025-05-07T19:47:35.6356903Z 2025-05-07T19:47:35.6356981Z processor : 11 2025-05-07T19:47:35.6357186Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6357415Z cpu family : 23 2025-05-07T19:47:35.6357634Z model : 49 2025-05-07T19:47:35.6357859Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6358100Z stepping : 0 2025-05-07T19:47:35.6358299Z microcode : 0x830107f 2025-05-07T19:47:35.6358515Z cpu MHz : 3298.105 2025-05-07T19:47:35.6358714Z cache size : 512 KB 2025-05-07T19:47:35.6358922Z physical id : 0 2025-05-07T19:47:35.6359115Z siblings : 16 2025-05-07T19:47:35.6359310Z core id : 3 2025-05-07T19:47:35.6359492Z cpu cores : 8 2025-05-07T19:47:35.6359687Z apicid : 7 2025-05-07T19:47:35.6359869Z initial apicid : 7 2025-05-07T19:47:35.6360070Z fpu : yes 2025-05-07T19:47:35.6360253Z fpu_exception : yes 2025-05-07T19:47:35.6360467Z cpuid level : 13 2025-05-07T19:47:35.6360668Z wp : yes 2025-05-07T19:47:35.6362626Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6364991Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6365475Z bogomips : 5599.99 2025-05-07T19:47:35.6365681Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6365908Z clflush size : 64 2025-05-07T19:47:35.6366108Z cache_alignment : 64 2025-05-07T19:47:35.6366467Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6366771Z power management: 2025-05-07T19:47:35.6366908Z 2025-05-07T19:47:35.6366987Z processor : 12 2025-05-07T19:47:35.6367188Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6367419Z cpu family : 23 2025-05-07T19:47:35.6367610Z model : 49 2025-05-07T19:47:35.6367815Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6368050Z stepping : 0 2025-05-07T19:47:35.6368242Z microcode : 0x830107f 2025-05-07T19:47:35.6368469Z cpu MHz : 3287.959 2025-05-07T19:47:35.6368668Z cache size : 512 KB 2025-05-07T19:47:35.6368881Z physical id : 0 2025-05-07T19:47:35.6369083Z siblings : 16 2025-05-07T19:47:35.6369279Z core id : 4 2025-05-07T19:47:35.6369466Z cpu cores : 8 2025-05-07T19:47:35.6369660Z apicid : 9 2025-05-07T19:47:35.6369845Z initial apicid : 9 2025-05-07T19:47:35.6370051Z fpu : yes 2025-05-07T19:47:35.6370236Z fpu_exception : yes 2025-05-07T19:47:35.6370448Z cpuid level : 13 2025-05-07T19:47:35.6370651Z wp : yes 2025-05-07T19:47:35.6372644Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6374903Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6375394Z bogomips : 5599.99 2025-05-07T19:47:35.6375602Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6375828Z clflush size : 64 2025-05-07T19:47:35.6376031Z cache_alignment : 64 2025-05-07T19:47:35.6376301Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6376602Z power management: 2025-05-07T19:47:35.6376740Z 2025-05-07T19:47:35.6376816Z processor : 13 2025-05-07T19:47:35.6377022Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6377250Z cpu family : 23 2025-05-07T19:47:35.6377439Z model : 49 2025-05-07T19:47:35.6377636Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6377868Z stepping : 0 2025-05-07T19:47:35.6378057Z microcode : 0x830107f 2025-05-07T19:47:35.6378275Z cpu MHz : 3302.224 2025-05-07T19:47:35.6378474Z cache size : 512 KB 2025-05-07T19:47:35.6378688Z physical id : 0 2025-05-07T19:47:35.6378882Z siblings : 16 2025-05-07T19:47:35.6379075Z core id : 5 2025-05-07T19:47:35.6379259Z cpu cores : 8 2025-05-07T19:47:35.6379451Z apicid : 11 2025-05-07T19:47:35.6379640Z initial apicid : 11 2025-05-07T19:47:35.6379857Z fpu : yes 2025-05-07T19:47:35.6380042Z fpu_exception : yes 2025-05-07T19:47:35.6380253Z cpuid level : 13 2025-05-07T19:47:35.6380447Z wp : yes 2025-05-07T19:47:35.6382420Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6385023Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6385539Z bogomips : 5599.99 2025-05-07T19:47:35.6385759Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6386050Z clflush size : 64 2025-05-07T19:47:35.6386253Z cache_alignment : 64 2025-05-07T19:47:35.6386518Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6386890Z power management: 2025-05-07T19:47:35.6387029Z 2025-05-07T19:47:35.6387275Z processor : 14 2025-05-07T19:47:35.6387484Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6387740Z cpu family : 23 2025-05-07T19:47:35.6387953Z model : 49 2025-05-07T19:47:35.6388160Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6388394Z stepping : 0 2025-05-07T19:47:35.6388584Z microcode : 0x830107f 2025-05-07T19:47:35.6388803Z cpu MHz : 3299.983 2025-05-07T19:47:35.6389002Z cache size : 512 KB 2025-05-07T19:47:35.6389213Z physical id : 0 2025-05-07T19:47:35.6389406Z siblings : 16 2025-05-07T19:47:35.6389604Z core id : 6 2025-05-07T19:47:35.6389789Z cpu cores : 8 2025-05-07T19:47:35.6389976Z apicid : 13 2025-05-07T19:47:35.6390164Z initial apicid : 13 2025-05-07T19:47:35.6390361Z fpu : yes 2025-05-07T19:47:35.6390538Z fpu_exception : yes 2025-05-07T19:47:35.6390741Z cpuid level : 13 2025-05-07T19:47:35.6390924Z wp : yes 2025-05-07T19:47:35.6392992Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6395218Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6395700Z bogomips : 5599.99 2025-05-07T19:47:35.6395898Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6396120Z clflush size : 64 2025-05-07T19:47:35.6396316Z cache_alignment : 64 2025-05-07T19:47:35.6396573Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6396864Z power management: 2025-05-07T19:47:35.6396993Z 2025-05-07T19:47:35.6397067Z processor : 15 2025-05-07T19:47:35.6397263Z vendor_id : AuthenticAMD 2025-05-07T19:47:35.6397486Z cpu family : 23 2025-05-07T19:47:35.6397672Z model : 49 2025-05-07T19:47:35.6397881Z model name : AMD EPYC 7R32 2025-05-07T19:47:35.6398143Z stepping : 0 2025-05-07T19:47:35.6398329Z microcode : 0x830107f 2025-05-07T19:47:35.6398538Z cpu MHz : 3301.625 2025-05-07T19:47:35.6398729Z cache size : 512 KB 2025-05-07T19:47:35.6398931Z physical id : 0 2025-05-07T19:47:35.6399117Z siblings : 16 2025-05-07T19:47:35.6399301Z core id : 7 2025-05-07T19:47:35.6399483Z cpu cores : 8 2025-05-07T19:47:35.6399670Z apicid : 15 2025-05-07T19:47:35.6399854Z initial apicid : 15 2025-05-07T19:47:35.6400052Z fpu : yes 2025-05-07T19:47:35.6400232Z fpu_exception : yes 2025-05-07T19:47:35.6400429Z cpuid level : 13 2025-05-07T19:47:35.6400613Z wp : yes 2025-05-07T19:47:35.6402564Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:47:35.6404788Z bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret 2025-05-07T19:47:35.6405388Z bogomips : 5599.99 2025-05-07T19:47:35.6405588Z TLB size : 3072 4K pages 2025-05-07T19:47:35.6405809Z clflush size : 64 2025-05-07T19:47:35.6406005Z cache_alignment : 64 2025-05-07T19:47:35.6406261Z address sizes : 48 bits physical, 48 bits virtual 2025-05-07T19:47:35.6406554Z power management: 2025-05-07T19:47:35.6406684Z 2025-05-07T19:47:35.6406688Z 2025-05-07T19:47:35.6406797Z ################################################################################ 2025-05-07T19:47:35.6407221Z [INFO] Print Linux distribution info ... 2025-05-07T19:47:35.6407506Z + uname -a 2025-05-07T19:47:35.6407614Z 2025-05-07T19:47:35.6407978Z Linux d075a29d5e53 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-05-07T19:47:35.6408427Z 2025-05-07T19:47:35.6408501Z + uname -m 2025-05-07T19:47:35.6408614Z 2025-05-07T19:47:35.6408705Z x86_64 2025-05-07T19:47:35.6408805Z 2025-05-07T19:47:35.6408897Z + cat /proc/version 2025-05-07T19:47:35.6409033Z 2025-05-07T19:47:35.6409563Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 2025-05-07T19:47:35.6410185Z 2025-05-07T19:47:35.6410267Z + cat /etc/os-release 2025-05-07T19:47:35.6410408Z 2025-05-07T19:47:35.6410484Z NAME="AlmaLinux" 2025-05-07T19:47:35.6410713Z VERSION="8.10 (Cerulean Leopard)" 2025-05-07T19:47:35.6410962Z ID="almalinux" 2025-05-07T19:47:35.6417664Z ID_LIKE="rhel centos fedora" 2025-05-07T19:47:35.6417955Z VERSION_ID="8.10" 2025-05-07T19:47:35.6418218Z PLATFORM_ID="platform:el8" 2025-05-07T19:47:35.6418502Z PRETTY_NAME="AlmaLinux 8.10 (Cerulean Leopard)" 2025-05-07T19:47:35.6418805Z ANSI_COLOR="0;34" 2025-05-07T19:47:35.6419026Z LOGO="fedora-logo-icon" 2025-05-07T19:47:35.6419312Z CPE_NAME="cpe:/o:almalinux:almalinux:8::baseos" 2025-05-07T19:47:35.6419653Z HOME_URL="https://almalinux.org/" 2025-05-07T19:47:35.6419982Z DOCUMENTATION_URL="https://wiki.almalinux.org/" 2025-05-07T19:47:35.6420337Z BUG_REPORT_URL="https://bugs.almalinux.org/" 2025-05-07T19:47:35.6420552Z 2025-05-07T19:47:35.6420660Z ALMALINUX_MANTISBT_PROJECT="AlmaLinux-8" 2025-05-07T19:47:35.6420971Z ALMALINUX_MANTISBT_PROJECT_VERSION="8.10" 2025-05-07T19:47:35.6421269Z REDHAT_SUPPORT_PRODUCT="AlmaLinux" 2025-05-07T19:47:35.6421555Z REDHAT_SUPPORT_PRODUCT_VERSION="8.10" 2025-05-07T19:47:35.6421831Z SUPPORT_END=2029-06-01 2025-05-07T19:47:35.6421973Z 2025-05-07T19:47:35.6422106Z [NOVA] Time taken to display System Info: 0 seconds 2025-05-07T19:47:35.6422447Z ################################################################################ 2025-05-07T19:47:35.6422743Z # Print Conda Environment Info 2025-05-07T19:47:35.6422984Z # 2025-05-07T19:47:35.6423205Z # [2025-05-07T19:47:35.641Z] + print_conda_info 2025-05-07T19:47:35.6423517Z ################################################################################ 2025-05-07T19:47:35.6423727Z 2025-05-07T19:47:35.6423810Z + conda info 2025-05-07T19:47:35.6423924Z 2025-05-07T19:47:36.3016383Z 2025-05-07T19:47:36.3017427Z active environment : /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:36.3018110Z active env location : /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:36.3018639Z shell level : 2 2025-05-07T19:47:36.3019042Z user config file : /github/home/.condarc 2025-05-07T19:47:36.3019534Z populated config files : /github/home/.condarc 2025-05-07T19:47:36.3020024Z conda version : 23.5.2 2025-05-07T19:47:36.3020396Z conda-build version : 24.3.0 2025-05-07T19:47:36.3020783Z python version : 3.11.4.final.0 2025-05-07T19:47:36.3021205Z virtual packages : __archspec=1=x86_64 2025-05-07T19:47:36.3021625Z __cuda=12.8=0 2025-05-07T19:47:36.3022003Z __glibc=2.28=0 2025-05-07T19:47:36.3022400Z __linux=6.1.130=0 2025-05-07T19:47:36.3022781Z __unix=0=0 2025-05-07T19:47:36.3023607Z base environment : /opt/conda (writable) 2025-05-07T19:47:36.3024089Z conda av data dir : /opt/conda/etc/conda 2025-05-07T19:47:36.3024534Z conda av metadata url : None 2025-05-07T19:47:36.3025023Z channel URLs : https://repo.anaconda.com/pkgs/main/linux-64 2025-05-07T19:47:36.3025632Z https://repo.anaconda.com/pkgs/main/noarch 2025-05-07T19:47:36.3026176Z https://repo.anaconda.com/pkgs/r/linux-64 2025-05-07T19:47:36.3026938Z https://repo.anaconda.com/pkgs/r/noarch 2025-05-07T19:47:36.3027422Z package cache : /opt/conda/pkgs 2025-05-07T19:47:36.3027861Z /github/home/.conda/pkgs 2025-05-07T19:47:36.3028300Z envs directories : /opt/conda/envs 2025-05-07T19:47:36.3028735Z /github/home/.conda/envs 2025-05-07T19:47:36.3029165Z platform : linux-64 2025-05-07T19:47:36.3030037Z user-agent : conda/23.5.2 requests/2.29.0 CPython/3.11.4 Linux/6.1.130-139.222.amzn2023.x86_64 almalinux/8.10 glibc/2.28 aau/0.7.0 c/. s/. e/. 2025-05-07T19:47:36.3030936Z UID:GID : 0:0 2025-05-07T19:47:36.3031308Z netrc file : None 2025-05-07T19:47:36.3031659Z offline mode : False 2025-05-07T19:47:36.3031898Z 2025-05-07T19:47:36.3031904Z 2025-05-07T19:47:36.3477889Z 2025-05-07T19:47:36.3478143Z 2025-05-07T19:47:36.3478515Z + conda info --envs 2025-05-07T19:47:36.3478692Z 2025-05-07T19:47:36.9965211Z # conda environments: 2025-05-07T19:47:36.9965489Z # 2025-05-07T19:47:36.9965685Z base /opt/conda 2025-05-07T19:47:36.9965864Z 2025-05-07T19:47:37.0427241Z 2025-05-07T19:47:37.0427286Z 2025-05-07T19:47:37.0427436Z PYTHON_VERSION: 3.9 2025-05-07T19:47:37.0453297Z python3 --version: Python 3.9.21 2025-05-07T19:47:37.0485374Z [NOVA] Time taken to display Conda information: 2 seconds 2025-05-07T19:47:37.0485773Z ################################################################################ 2025-05-07T19:47:37.0486160Z [INFO] Printing NVIDIA GPU info ... 2025-05-07T19:47:37.0496757Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.github/scripts/utils_system.bash: line 144: lspci: command not found 2025-05-07T19:47:37.0545434Z /usr/bin/nvidia-smi 2025-05-07T19:47:37.0551650Z + nvidia-smi 2025-05-07T19:47:37.0552300Z 2025-05-07T19:47:37.0727050Z Wed May 7 19:47:37 2025 2025-05-07T19:47:37.0727425Z +-----------------------------------------------------------------------------------------+ 2025-05-07T19:47:37.0727920Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-05-07T19:47:37.0728392Z |-----------------------------------------+------------------------+----------------------+ 2025-05-07T19:47:37.0728866Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-05-07T19:47:37.0729390Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-05-07T19:47:37.0729808Z | | | MIG M. | 2025-05-07T19:47:37.0730127Z |=========================================+========================+======================| 2025-05-07T19:47:37.0898882Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-05-07T19:47:37.0899313Z | 0% 28C P8 18W / 300W | 0MiB / 23028MiB | 0% Default | 2025-05-07T19:47:37.0899688Z | | | N/A | 2025-05-07T19:47:37.0900058Z +-----------------------------------------+------------------------+----------------------+ 2025-05-07T19:47:37.0903164Z 2025-05-07T19:47:37.0903606Z +-----------------------------------------------------------------------------------------+ 2025-05-07T19:47:37.0904012Z | Processes: | 2025-05-07T19:47:37.0904747Z | GPU GI CI PID Type Process name GPU Memory | 2025-05-07T19:47:37.0905140Z | ID ID Usage | 2025-05-07T19:47:37.0905467Z |=========================================================================================| 2025-05-07T19:47:37.0908602Z | No running processes found | 2025-05-07T19:47:37.0909077Z +-----------------------------------------------------------------------------------------+ 2025-05-07T19:47:37.3414335Z 2025-05-07T19:47:37.3419982Z ################################################################################ 2025-05-07T19:47:37.3420358Z [INFO] Printing AMD GPU info ... 2025-05-07T19:47:37.3431839Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.github/scripts/utils_system.bash: line 164: lspci: command not found 2025-05-07T19:47:37.3485409Z /usr/bin/which: no rocminfo in (/__w/_temp/conda_environment_14891846312/bin:/opt/conda/condabin:/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/sbin) 2025-05-07T19:47:37.3486841Z [CHECK] rocminfo not found 2025-05-07T19:47:37.3522770Z /usr/bin/which: no rocm-smi in (/__w/_temp/conda_environment_14891846312/bin:/opt/conda/condabin:/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/sbin) 2025-05-07T19:47:37.3524203Z [CHECK] rocm-smi not found 2025-05-07T19:47:37.3555387Z [NOVA] Time taken to display GPU Info: 0 seconds 2025-05-07T19:47:37.3556163Z ################################################################################ 2025-05-07T19:47:37.3556736Z # Install Build Tools 2025-05-07T19:47:37.3557177Z # 2025-05-07T19:47:37.3588274Z # [2025-05-07T19:47:37.358Z] + install_build_tools /__w/_temp/conda_environment_14891846312 2025-05-07T19:47:37.3588778Z ################################################################################ 2025-05-07T19:47:37.3588996Z 2025-05-07T19:47:37.3619969Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T19:47:37.4920611Z [CHECK] Network does not appear to be blocked. 2025-05-07T19:47:37.4929574Z [INSTALL] Installing build tools ... 2025-05-07T19:47:37.4966020Z [EXEC] [ATTEMPT 0/3] + conda install -p /__w/_temp/conda_environment_14891846312 -c conda-forge --override-channels -y auditwheel bazel cmake>=3.30 hypothesis jinja2 make ncurses ninja openblas patchelf rhash scikit-build wheel pyyaml 2025-05-07T19:48:00.2933988Z Collecting package metadata (current_repodata.json): ...working... done 2025-05-07T19:48:03.0464365Z Solving environment: ...working... done 2025-05-07T19:48:03.3023464Z 2025-05-07T19:48:03.3023639Z ## Package Plan ## 2025-05-07T19:48:03.3023956Z 2025-05-07T19:48:03.3024149Z environment location: /__w/_temp/conda_environment_14891846312 2025-05-07T19:48:03.3024437Z 2025-05-07T19:48:03.3024526Z added / updated specs: 2025-05-07T19:48:03.3024767Z - auditwheel 2025-05-07T19:48:03.3024961Z - bazel 2025-05-07T19:48:03.3025166Z - cmake[version='>=3.30'] 2025-05-07T19:48:03.3025400Z - hypothesis 2025-05-07T19:48:03.3025724Z - jinja2 2025-05-07T19:48:03.3026204Z - make 2025-05-07T19:48:03.3026638Z - ncurses 2025-05-07T19:48:03.3027005Z - ninja 2025-05-07T19:48:03.3027268Z - openblas 2025-05-07T19:48:03.3027471Z - patchelf 2025-05-07T19:48:03.3027660Z - pyyaml 2025-05-07T19:48:03.3027847Z - rhash 2025-05-07T19:48:03.3028030Z - scikit-build 2025-05-07T19:48:03.3028274Z - wheel 2025-05-07T19:48:03.3028389Z 2025-05-07T19:48:03.3028394Z 2025-05-07T19:48:03.3028511Z The following packages will be downloaded: 2025-05-07T19:48:03.3028985Z 2025-05-07T19:48:03.3029098Z package | build 2025-05-07T19:48:03.3029411Z ---------------------------|----------------- 2025-05-07T19:48:03.3029771Z alsa-lib-1.2.14 | hb9d3cd8_0 553 KB conda-forge 2025-05-07T19:48:03.3030177Z attrs-25.3.0 | pyh71513ae_0 56 KB conda-forge 2025-05-07T19:48:03.3030590Z auditwheel-6.2.0 | pyha804496_1 40 KB conda-forge 2025-05-07T19:48:03.3031146Z bazel-7.5.0 | h96810dc_2 47.4 MB conda-forge 2025-05-07T19:48:03.3031533Z c-ares-1.34.5 | hb9d3cd8_0 202 KB conda-forge 2025-05-07T19:48:03.3032091Z ca-certificates-2025.4.26 | hbd8a1cb_0 149 KB conda-forge 2025-05-07T19:48:03.3032522Z cairo-1.18.4 | h3394656_0 955 KB conda-forge 2025-05-07T19:48:03.3032903Z click-8.1.8 | pyh707e725_0 83 KB conda-forge 2025-05-07T19:48:03.3033282Z cmake-4.0.2 | h74e3db0_0 19.4 MB conda-forge 2025-05-07T19:48:03.3033668Z distro-1.9.0 | pyhd8ed1ab_1 41 KB conda-forge 2025-05-07T19:48:03.3034100Z exceptiongroup-1.2.2 | pyhd8ed1ab_1 20 KB conda-forge 2025-05-07T19:48:03.3034612Z font-ttf-dejavu-sans-mono-2.37| hab24e00_0 388 KB conda-forge 2025-05-07T19:48:03.3035119Z font-ttf-inconsolata-3.000 | h77eed37_0 94 KB conda-forge 2025-05-07T19:48:03.3035622Z font-ttf-source-code-pro-2.038| h77eed37_0 684 KB conda-forge 2025-05-07T19:48:03.3036101Z font-ttf-ubuntu-0.83 | h77eed37_3 1.5 MB conda-forge 2025-05-07T19:48:03.3036526Z fontconfig-2.15.0 | h7e30c49_1 259 KB conda-forge 2025-05-07T19:48:03.3036973Z fonts-conda-ecosystem-1 | 0 4 KB conda-forge 2025-05-07T19:48:03.3037420Z fonts-conda-forge-1 | 0 4 KB conda-forge 2025-05-07T19:48:03.3037851Z freetype-2.13.3 | ha770c72_1 168 KB conda-forge 2025-05-07T19:48:03.3038241Z giflib-5.2.2 | hd590300_0 75 KB conda-forge 2025-05-07T19:48:03.3038632Z graphite2-1.3.13 | h59595ed_1003 95 KB conda-forge 2025-05-07T19:48:03.3039042Z harfbuzz-11.1.0 | h3beb420_0 1.6 MB conda-forge 2025-05-07T19:48:03.3039462Z hypothesis-6.131.14 | pyha770c72_0 348 KB conda-forge 2025-05-07T19:48:03.3039858Z icu-75.1 | he02047a_0 11.6 MB conda-forge 2025-05-07T19:48:03.3040213Z ijar-7.5.0 | h5888daf_0 114 KB conda-forge 2025-05-07T19:48:03.3040591Z jinja2-3.1.6 | pyhd8ed1ab_0 110 KB conda-forge 2025-05-07T19:48:03.3040994Z keyutils-1.6.1 | h166bdaf_0 115 KB conda-forge 2025-05-07T19:48:03.3041377Z krb5-1.21.3 | h659f571_0 1.3 MB conda-forge 2025-05-07T19:48:03.3041749Z lcms2-2.17 | h717163a_0 242 KB conda-forge 2025-05-07T19:48:03.3042110Z lerc-4.0.0 | h0aef613_1 258 KB conda-forge 2025-05-07T19:48:03.3042527Z libabseil-20250127.1 | cxx17_hbbce691_0 1.3 MB conda-forge 2025-05-07T19:48:03.3042956Z libcups-2.3.3 | h4637d8d_4 4.3 MB conda-forge 2025-05-07T19:48:03.3043349Z libcurl-8.13.0 | h332b0f4_0 428 KB conda-forge 2025-05-07T19:48:03.3043753Z libdeflate-1.23 | h86f0d12_0 71 KB conda-forge 2025-05-07T19:48:03.3044153Z libexpat-2.7.0 | h5888daf_0 73 KB conda-forge 2025-05-07T19:48:03.3044546Z libffi-3.4.6 | h2dba641_1 56 KB conda-forge 2025-05-07T19:48:03.3044948Z libfreetype-2.13.3 | ha770c72_1 8 KB conda-forge 2025-05-07T19:48:03.3045473Z libfreetype6-2.13.3 | h48d6fc4_1 371 KB conda-forge 2025-05-07T19:48:03.3045886Z libgcc-15.1.0 | h767d61c_2 810 KB conda-forge 2025-05-07T19:48:03.3046275Z libgcc-ng-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T19:48:03.3046692Z libgfortran-15.1.0 | h69a702a_2 34 KB conda-forge 2025-05-07T19:48:03.3047195Z libgfortran5-15.1.0 | hcea5267_2 1.5 MB conda-forge 2025-05-07T19:48:03.3047614Z libglib-2.84.1 | h3618099_1 3.8 MB conda-forge 2025-05-07T19:48:03.3048008Z libgomp-15.1.0 | h767d61c_2 442 KB conda-forge 2025-05-07T19:48:03.3048397Z libgrpc-1.71.0 | h8e591d7_1 7.6 MB conda-forge 2025-05-07T19:48:03.3048795Z libiconv-1.18 | h4ce23a2_1 696 KB conda-forge 2025-05-07T19:48:03.3049215Z libjpeg-turbo-3.1.0 | hb9d3cd8_0 614 KB conda-forge 2025-05-07T19:48:03.3049634Z liblzma-5.8.1 | hb9d3cd8_0 110 KB conda-forge 2025-05-07T19:48:03.3050037Z libnghttp2-1.64.0 | h161d5f1_0 632 KB conda-forge 2025-05-07T19:48:03.3050440Z libnsl-2.0.1 | hd590300_0 33 KB conda-forge 2025-05-07T19:48:03.3050870Z libopenblas-0.3.29 |pthreads_h94d23a6_0 5.6 MB conda-forge 2025-05-07T19:48:03.3051296Z libpng-1.6.47 | h943b412_0 282 KB conda-forge 2025-05-07T19:48:03.3051705Z libprotobuf-5.29.3 | h501fc15_1 3.2 MB conda-forge 2025-05-07T19:48:03.3052121Z libre2-11-2024.07.02 | hba17884_3 205 KB conda-forge 2025-05-07T19:48:03.3052545Z libsqlite-3.49.2 | hee588c1_0 895 KB conda-forge 2025-05-07T19:48:03.3052941Z libssh2-1.11.1 | hcf80075_0 298 KB conda-forge 2025-05-07T19:48:03.3053348Z libstdcxx-15.1.0 | h8f9b012_2 3.7 MB conda-forge 2025-05-07T19:48:03.3053768Z libstdcxx-ng-15.1.0 | h4852527_2 34 KB conda-forge 2025-05-07T19:48:03.3054171Z libtiff-4.7.0 | hd9ff511_4 419 KB conda-forge 2025-05-07T19:48:03.3054569Z libuuid-2.38.1 | h0b41bf4_0 33 KB conda-forge 2025-05-07T19:48:03.3054957Z libuv-1.50.0 | hb9d3cd8_0 870 KB conda-forge 2025-05-07T19:48:03.3055357Z libwebp-base-1.5.0 | h851e524_0 420 KB conda-forge 2025-05-07T19:48:03.3055759Z libxcb-1.17.0 | h8a09558_0 387 KB conda-forge 2025-05-07T19:48:03.3056148Z libxcrypt-4.4.36 | hd590300_1 98 KB conda-forge 2025-05-07T19:48:03.3056551Z libzlib-1.3.1 | hb9d3cd8_2 60 KB conda-forge 2025-05-07T19:48:03.3056925Z make-4.4.1 | hb9d3cd8_2 501 KB conda-forge 2025-05-07T19:48:03.3057330Z markupsafe-3.0.2 | py39h9399b63_1 22 KB conda-forge 2025-05-07T19:48:03.3057729Z ncurses-6.5 | h2d0b736_3 871 KB conda-forge 2025-05-07T19:48:03.3058114Z ninja-1.12.1 | hff21bea_1 158 KB conda-forge 2025-05-07T19:48:03.3058532Z openblas-0.3.29 |pthreads_h6ec200e_0 5.8 MB conda-forge 2025-05-07T19:48:03.3058954Z openjdk-23.0.2 | h53dfc1b_2 181.4 MB conda-forge 2025-05-07T19:48:03.3059345Z openssl-3.5.0 | h7b32b05_1 3.0 MB conda-forge 2025-05-07T19:48:03.3059744Z packaging-25.0 | pyh29332c3_1 61 KB conda-forge 2025-05-07T19:48:03.3060155Z patchelf-0.18.0 | h3f2d84a_2 133 KB conda-forge 2025-05-07T19:48:03.3060539Z pcre2-10.45 | hc749103_0 1.1 MB conda-forge 2025-05-07T19:48:03.3061014Z pixman-0.46.0 | h29eaf8c_0 389 KB conda-forge 2025-05-07T19:48:03.3061426Z pthread-stubs-0.4 | hb9d3cd8_1002 8 KB conda-forge 2025-05-07T19:48:03.3061854Z pyelftools-0.32 | pyh707e725_1 146 KB conda-forge 2025-05-07T19:48:03.3062278Z python-3.9.22 |h85ef794_1_cpython 22.5 MB conda-forge 2025-05-07T19:48:03.3062751Z python_abi-3.9 | 7_cp39 7 KB conda-forge 2025-05-07T19:48:03.3063146Z pyyaml-6.0.2 | py39h9399b63_2 178 KB conda-forge 2025-05-07T19:48:03.3063537Z re2-2024.07.02 | h9925aae_3 26 KB conda-forge 2025-05-07T19:48:03.3063912Z rhash-1.4.5 | hb9d3cd8_0 183 KB conda-forge 2025-05-07T19:48:03.3064326Z scikit-build-0.18.1 | pyhae55e72_2 114 KB conda-forge 2025-05-07T19:48:03.3064749Z singlejar-7.5.0 | h0e684df_1 122 KB conda-forge 2025-05-07T19:48:03.3065196Z sortedcontainers-2.4.0 | pyhd8ed1ab_1 28 KB conda-forge 2025-05-07T19:48:03.3065631Z tk-8.6.13 |noxft_h4845f30_101 3.2 MB conda-forge 2025-05-07T19:48:03.3066009Z tomli-2.2.1 | pyhd8ed1ab_1 19 KB conda-forge 2025-05-07T19:48:03.3066451Z typing-extensions-4.13.2 | h0e9735f_0 88 KB conda-forge 2025-05-07T19:48:03.3066928Z typing_extensions-4.13.2 | pyh29332c3_0 51 KB conda-forge 2025-05-07T19:48:03.3067352Z wheel-0.45.1 | pyhd8ed1ab_1 61 KB conda-forge 2025-05-07T19:48:03.3067760Z xorg-libice-1.1.2 | hb9d3cd8_0 57 KB conda-forge 2025-05-07T19:48:03.3068170Z xorg-libsm-1.2.6 | he73a12e_0 27 KB conda-forge 2025-05-07T19:48:03.3068582Z xorg-libx11-1.8.12 | h4f16b4b_0 816 KB conda-forge 2025-05-07T19:48:03.3069005Z xorg-libxau-1.0.12 | hb9d3cd8_0 14 KB conda-forge 2025-05-07T19:48:03.3069426Z xorg-libxdmcp-1.1.5 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T19:48:03.3069855Z xorg-libxext-1.3.6 | hb9d3cd8_0 49 KB conda-forge 2025-05-07T19:48:03.3070285Z xorg-libxfixes-6.0.1 | hb9d3cd8_0 19 KB conda-forge 2025-05-07T19:48:03.3070712Z xorg-libxi-1.8.2 | hb9d3cd8_0 46 KB conda-forge 2025-05-07T19:48:03.3071132Z xorg-libxrandr-1.5.4 | hb9d3cd8_0 29 KB conda-forge 2025-05-07T19:48:03.3071583Z xorg-libxrender-0.9.12 | hb9d3cd8_0 32 KB conda-forge 2025-05-07T19:48:03.3072085Z xorg-libxt-1.3.1 | hb9d3cd8_0 371 KB conda-forge 2025-05-07T19:48:03.3072496Z xorg-libxtst-1.2.5 | hb9d3cd8_3 32 KB conda-forge 2025-05-07T19:48:03.3072890Z yaml-0.2.5 | h7f98852_2 87 KB conda-forge 2025-05-07T19:48:03.3073252Z zstd-1.5.7 | hb8e6e7a_2 554 KB conda-forge 2025-05-07T19:48:03.3073610Z ------------------------------------------------------------ 2025-05-07T19:48:03.3073932Z Total: 349.2 MB 2025-05-07T19:48:03.3074145Z 2025-05-07T19:48:03.3074268Z The following NEW packages will be INSTALLED: 2025-05-07T19:48:03.3074484Z 2025-05-07T19:48:03.3074687Z alsa-lib conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 2025-05-07T19:48:03.3075105Z attrs conda-forge/noarch::attrs-25.3.0-pyh71513ae_0 2025-05-07T19:48:03.3075550Z auditwheel conda-forge/noarch::auditwheel-6.2.0-pyha804496_1 2025-05-07T19:48:03.3075978Z bazel conda-forge/linux-64::bazel-7.5.0-h96810dc_2 2025-05-07T19:48:03.3076371Z cairo conda-forge/linux-64::cairo-1.18.4-h3394656_0 2025-05-07T19:48:03.3076771Z click conda-forge/noarch::click-8.1.8-pyh707e725_0 2025-05-07T19:48:03.3077655Z distro conda-forge/noarch::distro-1.9.0-pyhd8ed1ab_1 2025-05-07T19:48:03.3078137Z exceptiongroup conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 2025-05-07T19:48:03.3078718Z font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 2025-05-07T19:48:03.3079464Z font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 2025-05-07T19:48:03.3080178Z font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 2025-05-07T19:48:03.3080739Z font-ttf-ubuntu conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 2025-05-07T19:48:03.3081233Z fontconfig conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 2025-05-07T19:48:03.3081712Z fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 2025-05-07T19:48:03.3082197Z fonts-conda-forge conda-forge/noarch::fonts-conda-forge-1-0 2025-05-07T19:48:03.3082654Z freetype conda-forge/linux-64::freetype-2.13.3-ha770c72_1 2025-05-07T19:48:03.3083079Z giflib conda-forge/linux-64::giflib-5.2.2-hd590300_0 2025-05-07T19:48:03.3083520Z graphite2 conda-forge/linux-64::graphite2-1.3.13-h59595ed_1003 2025-05-07T19:48:03.3083968Z harfbuzz conda-forge/linux-64::harfbuzz-11.1.0-h3beb420_0 2025-05-07T19:48:03.3084433Z hypothesis conda-forge/noarch::hypothesis-6.131.14-pyha770c72_0 2025-05-07T19:48:03.3085164Z icu conda-forge/linux-64::icu-75.1-he02047a_0 2025-05-07T19:48:03.3085563Z ijar conda-forge/linux-64::ijar-7.5.0-h5888daf_0 2025-05-07T19:48:03.3085969Z jinja2 conda-forge/noarch::jinja2-3.1.6-pyhd8ed1ab_0 2025-05-07T19:48:03.3086392Z keyutils conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 2025-05-07T19:48:03.3086806Z lcms2 conda-forge/linux-64::lcms2-2.17-h717163a_0 2025-05-07T19:48:03.3087195Z lerc conda-forge/linux-64::lerc-4.0.0-h0aef613_1 2025-05-07T19:48:03.3087651Z libabseil conda-forge/linux-64::libabseil-20250127.1-cxx17_hbbce691_0 2025-05-07T19:48:03.3088128Z libcups conda-forge/linux-64::libcups-2.3.3-h4637d8d_4 2025-05-07T19:48:03.3088565Z libdeflate conda-forge/linux-64::libdeflate-1.23-h86f0d12_0 2025-05-07T19:48:03.3089016Z libexpat conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 2025-05-07T19:48:03.3089475Z libfreetype conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 2025-05-07T19:48:03.3089978Z libfreetype6 conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 2025-05-07T19:48:03.3090438Z libgcc conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 2025-05-07T19:48:03.3090880Z libgfortran conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 2025-05-07T19:48:03.3091370Z libgfortran5 conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 2025-05-07T19:48:03.3091826Z libglib conda-forge/linux-64::libglib-2.84.1-h3618099_1 2025-05-07T19:48:03.3092253Z libgrpc conda-forge/linux-64::libgrpc-1.71.0-h8e591d7_1 2025-05-07T19:48:03.3092684Z libiconv conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 2025-05-07T19:48:03.3093156Z libjpeg-turbo conda-forge/linux-64::libjpeg-turbo-3.1.0-hb9d3cd8_0 2025-05-07T19:48:03.3093621Z liblzma conda-forge/linux-64::liblzma-5.8.1-hb9d3cd8_0 2025-05-07T19:48:03.3094029Z libnsl conda-forge/linux-64::libnsl-2.0.1-hd590300_0 2025-05-07T19:48:03.3094514Z libopenblas conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 2025-05-07T19:48:03.3094998Z libpng conda-forge/linux-64::libpng-1.6.47-h943b412_0 2025-05-07T19:48:03.3095440Z libprotobuf conda-forge/linux-64::libprotobuf-5.29.3-h501fc15_1 2025-05-07T19:48:03.3095911Z libre2-11 conda-forge/linux-64::libre2-11-2024.07.02-hba17884_3 2025-05-07T19:48:03.3096360Z libsqlite conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 2025-05-07T19:48:03.3096966Z libstdcxx conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 2025-05-07T19:48:03.3097458Z libtiff conda-forge/linux-64::libtiff-4.7.0-hd9ff511_4 2025-05-07T19:48:03.3097874Z libuuid conda-forge/linux-64::libuuid-2.38.1-h0b41bf4_0 2025-05-07T19:48:03.3098330Z libwebp-base conda-forge/linux-64::libwebp-base-1.5.0-h851e524_0 2025-05-07T19:48:03.3098774Z libxcb conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 2025-05-07T19:48:03.3099318Z libxcrypt conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 2025-05-07T19:48:03.3099760Z libzlib conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 2025-05-07T19:48:03.3100157Z make conda-forge/linux-64::make-4.4.1-hb9d3cd8_2 2025-05-07T19:48:03.3100597Z markupsafe conda-forge/linux-64::markupsafe-3.0.2-py39h9399b63_1 2025-05-07T19:48:03.3101096Z openblas conda-forge/linux-64::openblas-0.3.29-pthreads_h6ec200e_0 2025-05-07T19:48:03.3101564Z openjdk conda-forge/linux-64::openjdk-23.0.2-h53dfc1b_2 2025-05-07T19:48:03.3102017Z packaging conda-forge/noarch::packaging-25.0-pyh29332c3_1 2025-05-07T19:48:03.3102464Z patchelf conda-forge/linux-64::patchelf-0.18.0-h3f2d84a_2 2025-05-07T19:48:03.3102884Z pcre2 conda-forge/linux-64::pcre2-10.45-hc749103_0 2025-05-07T19:48:03.3103288Z pixman conda-forge/linux-64::pixman-0.46.0-h29eaf8c_0 2025-05-07T19:48:03.3103766Z pthread-stubs conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 2025-05-07T19:48:03.3104258Z pyelftools conda-forge/noarch::pyelftools-0.32-pyh707e725_1 2025-05-07T19:48:03.3104696Z python_abi conda-forge/noarch::python_abi-3.9-7_cp39 2025-05-07T19:48:03.3105118Z pyyaml conda-forge/linux-64::pyyaml-6.0.2-py39h9399b63_2 2025-05-07T19:48:03.3105519Z re2 conda-forge/linux-64::re2-2024.07.02-h9925aae_3 2025-05-07T19:48:03.3105965Z scikit-build conda-forge/noarch::scikit-build-0.18.1-pyhae55e72_2 2025-05-07T19:48:03.3106445Z singlejar conda-forge/linux-64::singlejar-7.5.0-h0e684df_1 2025-05-07T19:48:03.3106960Z sortedcontainers conda-forge/noarch::sortedcontainers-2.4.0-pyhd8ed1ab_1 2025-05-07T19:48:03.3107458Z tomli conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 2025-05-07T19:48:03.3107943Z typing-extensions conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 2025-05-07T19:48:03.3108521Z typing_extensions conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 2025-05-07T19:48:03.3109041Z xorg-libice conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 2025-05-07T19:48:03.3109510Z xorg-libsm conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 2025-05-07T19:48:03.3109975Z xorg-libx11 conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 2025-05-07T19:48:03.3110444Z xorg-libxau conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 2025-05-07T19:48:03.3110935Z xorg-libxdmcp conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 2025-05-07T19:48:03.3111433Z xorg-libxext conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 2025-05-07T19:48:03.3112025Z xorg-libxfixes conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 2025-05-07T19:48:03.3112523Z xorg-libxi conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 2025-05-07T19:48:03.3113005Z xorg-libxrandr conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 2025-05-07T19:48:03.3113543Z xorg-libxrender conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 2025-05-07T19:48:03.3114037Z xorg-libxt conda-forge/linux-64::xorg-libxt-1.3.1-hb9d3cd8_0 2025-05-07T19:48:03.3114506Z xorg-libxtst conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 2025-05-07T19:48:03.3114940Z yaml conda-forge/linux-64::yaml-0.2.5-h7f98852_2 2025-05-07T19:48:03.3115177Z 2025-05-07T19:48:03.3115283Z The following packages will be REMOVED: 2025-05-07T19:48:03.3115487Z 2025-05-07T19:48:03.3115575Z expat-2.7.1-h6a678d5_0 2025-05-07T19:48:03.3115901Z sqlite-3.45.3-h5eee18b_0 2025-05-07T19:48:03.3116223Z xz-5.6.4-h5eee18b_1 2025-05-07T19:48:03.3116464Z zlib-1.2.13-h5eee18b_1 2025-05-07T19:48:03.3116616Z 2025-05-07T19:48:03.3116724Z The following packages will be UPDATED: 2025-05-07T19:48:03.3116942Z 2025-05-07T19:48:03.3117264Z c-ares pkgs/main::c-ares-1.19.1-h5eee18b_0 --> conda-forge::c-ares-1.34.5-hb9d3cd8_0 2025-05-07T19:48:03.3118086Z ca-certificates pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 2025-05-07T19:48:03.3118798Z cmake pkgs/main::cmake-3.31.2-h27e300b_0 --> conda-forge::cmake-4.0.2-h74e3db0_0 2025-05-07T19:48:03.3119345Z krb5 pkgs/main::krb5-1.20.1-h143b758_1 --> conda-forge::krb5-1.21.3-h659f571_0 2025-05-07T19:48:03.3119907Z libcurl pkgs/main::libcurl-8.12.1-hc9e6f67_0 --> conda-forge::libcurl-8.13.0-h332b0f4_0 2025-05-07T19:48:03.3120573Z libffi pkgs/main::libffi-3.4.4-h6a678d5_1 --> conda-forge::libffi-3.4.6-h2dba641_1 2025-05-07T19:48:03.3121220Z libgcc-ng pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 2025-05-07T19:48:03.3121942Z libgomp pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 2025-05-07T19:48:03.3122601Z libnghttp2 pkgs/main::libnghttp2-1.57.0-h2d74bed~ --> conda-forge::libnghttp2-1.64.0-h161d5f1_0 2025-05-07T19:48:03.3123305Z libstdcxx-ng pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 2025-05-07T19:48:03.3123947Z libuv pkgs/main::libuv-1.48.0-h5eee18b_0 --> conda-forge::libuv-1.50.0-hb9d3cd8_0 2025-05-07T19:48:03.3124531Z ncurses pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 2025-05-07T19:48:03.3125109Z ninja pkgs/main::ninja-1.12.1-h06a4308_0 --> conda-forge::ninja-1.12.1-hff21bea_1 2025-05-07T19:48:03.3125705Z openssl pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 2025-05-07T19:48:03.3126332Z python pkgs/main::python-3.9.21-he870216_1 --> conda-forge::python-3.9.22-h85ef794_1_cpython 2025-05-07T19:48:03.3126939Z rhash pkgs/main::rhash-1.4.3-hdbd6064_0 --> conda-forge::rhash-1.4.5-hb9d3cd8_0 2025-05-07T19:48:03.3127531Z wheel pkgs/main::wheel-0.37.1-pyhd3eb1b0_0 --> conda-forge::wheel-0.45.1-pyhd8ed1ab_1 2025-05-07T19:48:03.3128100Z zstd pkgs/main::zstd-1.5.6-hc292b87_0 --> conda-forge::zstd-1.5.7-hb8e6e7a_2 2025-05-07T19:48:03.3128433Z 2025-05-07T19:48:03.3128645Z The following packages will be SUPERSEDED by a higher-priority channel: 2025-05-07T19:48:03.3128959Z 2025-05-07T19:48:03.3129220Z libssh2 pkgs/main::libssh2-1.11.1-h251f7ec_0 --> conda-forge::libssh2-1.11.1-hcf80075_0 2025-05-07T19:48:03.3129788Z tk pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 2025-05-07T19:48:03.3130175Z 2025-05-07T19:48:03.3130200Z 2025-05-07T19:48:03.3130206Z 2025-05-07T19:48:03.3130344Z Downloading and Extracting Packages 2025-05-07T19:48:03.3130538Z 2025-05-07T19:48:03.3130677Z xorg-libxt-1.3.1 | 371 KB | | 0% 2025-05-07T19:48:03.3130907Z 2025-05-07T19:48:03.3131713Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T19:48:03.3131960Z 2025-05-07T19:48:03.3141436Z 2025-05-07T19:48:03.3155756Z libprotobuf-5.29.3 | 3.2 MB | | 0%  2025-05-07T19:48:03.3156019Z 2025-05-07T19:48:03.3156023Z 2025-05-07T19:48:03.3156027Z 2025-05-07T19:48:03.3171572Z xorg-libxext-1.3.6 | 49 KB | | 0%  2025-05-07T19:48:03.3171866Z 2025-05-07T19:48:03.3171870Z 2025-05-07T19:48:03.3171874Z 2025-05-07T19:48:03.3171878Z 2025-05-07T19:48:03.3186991Z graphite2-1.3.13 | 95 KB | | 0%  2025-05-07T19:48:03.3187500Z 2025-05-07T19:48:03.3187506Z 2025-05-07T19:48:03.3187511Z 2025-05-07T19:48:03.3187516Z 2025-05-07T19:48:03.3187521Z 2025-05-07T19:48:03.3188937Z scikit-build-0.18.1 | 114 KB | | 0%  2025-05-07T19:48:03.3189220Z 2025-05-07T19:48:03.3189224Z 2025-05-07T19:48:03.3189228Z 2025-05-07T19:48:03.3189232Z 2025-05-07T19:48:03.3189235Z 2025-05-07T19:48:03.3189687Z 2025-05-07T19:48:03.3190052Z krb5-1.21.3 | 1.3 MB | | 0%  2025-05-07T19:48:03.3190536Z 2025-05-07T19:48:03.3190549Z 2025-05-07T19:48:03.3190553Z 2025-05-07T19:48:03.3190557Z 2025-05-07T19:48:03.3190560Z 2025-05-07T19:48:03.3190564Z 2025-05-07T19:48:03.3191698Z 2025-05-07T19:48:03.3193915Z xorg-libxau-1.0.12 | 14 KB | | 0%  2025-05-07T19:48:03.3194194Z 2025-05-07T19:48:03.3194199Z 2025-05-07T19:48:03.3194202Z 2025-05-07T19:48:03.3194206Z 2025-05-07T19:48:03.3194210Z 2025-05-07T19:48:03.3194213Z 2025-05-07T19:48:03.3194227Z 2025-05-07T19:48:03.3194231Z 2025-05-07T19:48:03.3196241Z libtiff-4.7.0 | 419 KB | | 0%  2025-05-07T19:48:03.3196651Z 2025-05-07T19:48:03.3196657Z 2025-05-07T19:48:03.3196662Z 2025-05-07T19:48:03.3196667Z 2025-05-07T19:48:03.3196672Z 2025-05-07T19:48:03.3196678Z 2025-05-07T19:48:03.3196683Z 2025-05-07T19:48:03.3196688Z 2025-05-07T19:48:03.3196694Z 2025-05-07T19:48:03.3215077Z freetype-2.13.3 | 168 KB | | 0%  2025-05-07T19:48:03.3225006Z 2025-05-07T19:48:03.3225014Z 2025-05-07T19:48:03.3225020Z 2025-05-07T19:48:03.3225026Z 2025-05-07T19:48:03.3225031Z 2025-05-07T19:48:03.3225037Z 2025-05-07T19:48:03.3225043Z 2025-05-07T19:48:03.3225053Z 2025-05-07T19:48:03.3225059Z 2025-05-07T19:48:03.3225064Z 2025-05-07T19:48:03.3225637Z bazel-7.5.0 | 47.4 MB | | 0%  2025-05-07T19:48:03.3226053Z 2025-05-07T19:48:03.3226060Z 2025-05-07T19:48:03.3226083Z 2025-05-07T19:48:03.3226088Z 2025-05-07T19:48:03.3226093Z 2025-05-07T19:48:03.3226099Z 2025-05-07T19:48:03.3226104Z 2025-05-07T19:48:03.3226138Z 2025-05-07T19:48:03.3226144Z 2025-05-07T19:48:03.3226149Z 2025-05-07T19:48:03.3226154Z 2025-05-07T19:48:03.3226537Z patchelf-0.18.0 | 133 KB | | 0%  2025-05-07T19:48:03.3226925Z 2025-05-07T19:48:03.3226930Z 2025-05-07T19:48:03.3226936Z 2025-05-07T19:48:03.3226941Z 2025-05-07T19:48:03.3226957Z 2025-05-07T19:48:03.3226972Z 2025-05-07T19:48:03.3226977Z 2025-05-07T19:48:03.3226982Z 2025-05-07T19:48:03.3226987Z 2025-05-07T19:48:03.3226993Z 2025-05-07T19:48:03.3226998Z 2025-05-07T19:48:03.3227003Z 2025-05-07T19:48:03.3227412Z cairo-1.18.4 | 955 KB | | 0%  2025-05-07T19:48:03.3227689Z 2025-05-07T19:48:03.3227693Z 2025-05-07T19:48:03.3227707Z 2025-05-07T19:48:03.3227710Z 2025-05-07T19:48:03.3227714Z 2025-05-07T19:48:03.3227718Z 2025-05-07T19:48:03.3227727Z 2025-05-07T19:48:03.3227730Z 2025-05-07T19:48:03.3227734Z 2025-05-07T19:48:03.3227738Z 2025-05-07T19:48:03.3227741Z 2025-05-07T19:48:03.3227745Z 2025-05-07T19:48:03.3227748Z 2025-05-07T19:48:03.3228004Z click-8.1.8 | 83 KB | | 0%  2025-05-07T19:48:03.3228346Z 2025-05-07T19:48:03.3228350Z 2025-05-07T19:48:03.3228353Z 2025-05-07T19:48:03.3228357Z 2025-05-07T19:48:03.3228361Z 2025-05-07T19:48:03.3228364Z 2025-05-07T19:48:03.3228374Z 2025-05-07T19:48:03.3228377Z 2025-05-07T19:48:03.3228381Z 2025-05-07T19:48:03.3228385Z 2025-05-07T19:48:03.3228389Z 2025-05-07T19:48:03.3228392Z 2025-05-07T19:48:03.3228396Z 2025-05-07T19:48:03.3228400Z 2025-05-07T19:48:03.3231524Z libfreetype-2.13.3 | 8 KB | | 0%  2025-05-07T19:48:03.3232094Z 2025-05-07T19:48:03.3232101Z 2025-05-07T19:48:03.3232108Z 2025-05-07T19:48:03.3232113Z 2025-05-07T19:48:03.3232118Z 2025-05-07T19:48:03.3232290Z 2025-05-07T19:48:03.3232294Z 2025-05-07T19:48:03.3232297Z 2025-05-07T19:48:03.3232301Z 2025-05-07T19:48:03.3232305Z 2025-05-07T19:48:03.3232308Z 2025-05-07T19:48:03.3232312Z 2025-05-07T19:48:03.3232316Z 2025-05-07T19:48:03.3232320Z 2025-05-07T19:48:03.3232329Z 2025-05-07T19:48:03.3233913Z distro-1.9.0 | 41 KB | | 0%  2025-05-07T19:48:03.3234322Z 2025-05-07T19:48:03.3234328Z 2025-05-07T19:48:03.3234333Z 2025-05-07T19:48:03.3234522Z 2025-05-07T19:48:03.3234530Z 2025-05-07T19:48:03.3234536Z 2025-05-07T19:48:03.3234541Z 2025-05-07T19:48:03.3234559Z 2025-05-07T19:48:03.3234565Z 2025-05-07T19:48:03.3234570Z 2025-05-07T19:48:03.3234575Z 2025-05-07T19:48:03.3234580Z 2025-05-07T19:48:03.3234585Z 2025-05-07T19:48:03.3234590Z 2025-05-07T19:48:03.3234595Z 2025-05-07T19:48:03.3234600Z 2025-05-07T19:48:03.3235941Z libgcc-15.1.0 | 810 KB | | 0%  2025-05-07T19:48:03.3236460Z 2025-05-07T19:48:03.3236469Z 2025-05-07T19:48:03.3236477Z 2025-05-07T19:48:03.3236485Z 2025-05-07T19:48:03.3236495Z 2025-05-07T19:48:03.3236539Z 2025-05-07T19:48:03.3236548Z 2025-05-07T19:48:03.3236558Z 2025-05-07T19:48:03.3236567Z 2025-05-07T19:48:03.3236575Z 2025-05-07T19:48:03.3236584Z 2025-05-07T19:48:03.3236592Z 2025-05-07T19:48:03.3236600Z 2025-05-07T19:48:03.3236607Z 2025-05-07T19:48:03.3236612Z 2025-05-07T19:48:03.3236618Z 2025-05-07T19:48:03.3236624Z 2025-05-07T19:48:03.3238092Z font-ttf-inconsolata | 94 KB | | 0%  2025-05-07T19:48:03.3238656Z 2025-05-07T19:48:03.3238662Z 2025-05-07T19:48:03.3238668Z 2025-05-07T19:48:03.3238673Z 2025-05-07T19:48:03.3238679Z 2025-05-07T19:48:03.3238684Z 2025-05-07T19:48:03.3238690Z 2025-05-07T19:48:03.3238695Z 2025-05-07T19:48:03.3238701Z 2025-05-07T19:48:03.3238706Z 2025-05-07T19:48:03.3238712Z 2025-05-07T19:48:03.3238718Z 2025-05-07T19:48:03.3238735Z 2025-05-07T19:48:03.3238741Z 2025-05-07T19:48:03.3238747Z 2025-05-07T19:48:03.3238753Z 2025-05-07T19:48:03.3238759Z 2025-05-07T19:48:03.3238764Z 2025-05-07T19:48:03.3239850Z libxcb-1.17.0 | 387 KB | | 0%  2025-05-07T19:48:03.3240341Z 2025-05-07T19:48:03.3240348Z 2025-05-07T19:48:03.3240354Z 2025-05-07T19:48:03.3240360Z 2025-05-07T19:48:03.3240366Z 2025-05-07T19:48:03.3240372Z 2025-05-07T19:48:03.3240389Z 2025-05-07T19:48:03.3240406Z 2025-05-07T19:48:03.3240412Z 2025-05-07T19:48:03.3240417Z 2025-05-07T19:48:03.3240422Z 2025-05-07T19:48:03.3240427Z 2025-05-07T19:48:03.3240433Z 2025-05-07T19:48:03.3240438Z 2025-05-07T19:48:03.3240444Z 2025-05-07T19:48:03.3240449Z 2025-05-07T19:48:03.3240455Z 2025-05-07T19:48:03.3240460Z 2025-05-07T19:48:03.3240466Z 2025-05-07T19:48:03.7053898Z ... (more hidden) ... 2025-05-07T19:48:03.7237447Z xorg-libxt-1.3.1 | 371 KB | 4 | 4% 2025-05-07T19:48:03.7237739Z 2025-05-07T19:48:03.7609138Z libopenblas-0.3.29 | 5.6 MB | | 0%  2025-05-07T19:48:03.7609415Z 2025-05-07T19:48:03.7609419Z 2025-05-07T19:48:03.7609423Z 2025-05-07T19:48:03.7617321Z xorg-libxext-1.3.6 | 49 KB | ###2 | 33%  2025-05-07T19:48:03.7647839Z xorg-libxt-1.3.1 | 371 KB | ########## | 100% 2025-05-07T19:48:03.7648080Z 2025-05-07T19:48:03.7648084Z 2025-05-07T19:48:03.7648088Z 2025-05-07T19:48:03.7648113Z 2025-05-07T19:48:03.7694014Z graphite2-1.3.13 | 95 KB | #6 | 17%  2025-05-07T19:48:03.7694310Z 2025-05-07T19:48:03.7694315Z 2025-05-07T19:48:03.7694320Z 2025-05-07T19:48:03.7694325Z 2025-05-07T19:48:03.7707547Z 2025-05-07T19:48:03.7759737Z scikit-build-0.18.1 | 114 KB | #3 | 14%  2025-05-07T19:48:03.7760017Z 2025-05-07T19:48:03.7762534Z 2025-05-07T19:48:03.7990832Z libprotobuf-5.29.3 | 3.2 MB | | 0%  2025-05-07T19:48:03.7991428Z 2025-05-07T19:48:03.7991434Z 2025-05-07T19:48:03.7991439Z 2025-05-07T19:48:03.7991444Z 2025-05-07T19:48:03.7991450Z 2025-05-07T19:48:03.7993076Z 2025-05-07T19:48:03.8112528Z krb5-1.21.3 | 1.3 MB | 1 | 1%  2025-05-07T19:48:03.8112836Z 2025-05-07T19:48:03.8112842Z 2025-05-07T19:48:03.8112847Z 2025-05-07T19:48:03.8112861Z 2025-05-07T19:48:03.8112867Z 2025-05-07T19:48:03.8112872Z 2025-05-07T19:48:03.8116187Z 2025-05-07T19:48:03.8317994Z xorg-libxau-1.0.12 | 14 KB | ########## | 100%  2025-05-07T19:48:03.8318318Z 2025-05-07T19:48:03.8318323Z 2025-05-07T19:48:03.8318326Z 2025-05-07T19:48:03.8318330Z 2025-05-07T19:48:03.8318334Z 2025-05-07T19:48:03.8318338Z 2025-05-07T19:48:03.8318342Z 2025-05-07T19:48:03.8318345Z 2025-05-07T19:48:03.8479765Z libtiff-4.7.0 | 419 KB | 3 | 4%  2025-05-07T19:48:03.8480057Z 2025-05-07T19:48:03.8480063Z 2025-05-07T19:48:03.8480096Z 2025-05-07T19:48:03.8480116Z 2025-05-07T19:48:03.8480122Z 2025-05-07T19:48:03.8480127Z 2025-05-07T19:48:03.8480132Z 2025-05-07T19:48:03.8480137Z 2025-05-07T19:48:03.8480142Z 2025-05-07T19:48:03.8490750Z freetype-2.13.3 | 168 KB | 9 | 10%  2025-05-07T19:48:03.8491068Z 2025-05-07T19:48:03.8491074Z 2025-05-07T19:48:03.8491080Z 2025-05-07T19:48:03.8491086Z 2025-05-07T19:48:03.8495761Z graphite2-1.3.13 | 95 KB | ########## | 100%  2025-05-07T19:48:03.8496088Z 2025-05-07T19:48:03.8496094Z 2025-05-07T19:48:03.8496100Z 2025-05-07T19:48:03.8643371Z xorg-libxext-1.3.6 | 49 KB | ########## | 100%  2025-05-07T19:48:03.8643692Z 2025-05-07T19:48:03.8643698Z 2025-05-07T19:48:03.8643706Z 2025-05-07T19:48:03.8643712Z 2025-05-07T19:48:03.8643718Z 2025-05-07T19:48:03.8643725Z 2025-05-07T19:48:03.8643733Z 2025-05-07T19:48:03.8643740Z 2025-05-07T19:48:03.8643748Z 2025-05-07T19:48:03.8645500Z 2025-05-07T19:48:03.8760058Z bazel-7.5.0 | 47.4 MB | | 0%  2025-05-07T19:48:03.8760421Z 2025-05-07T19:48:03.8766306Z 2025-05-07T19:48:03.8937634Z libprotobuf-5.29.3 | 3.2 MB | ######9 | 70%  2025-05-07T19:48:03.8938018Z 2025-05-07T19:48:03.8938025Z 2025-05-07T19:48:03.8938030Z 2025-05-07T19:48:03.8938035Z 2025-05-07T19:48:03.8938040Z 2025-05-07T19:48:03.8938044Z 2025-05-07T19:48:03.8938048Z 2025-05-07T19:48:03.8938052Z 2025-05-07T19:48:03.8938055Z 2025-05-07T19:48:03.8938087Z 2025-05-07T19:48:03.8938092Z 2025-05-07T19:48:03.8938095Z 2025-05-07T19:48:03.8938099Z 2025-05-07T19:48:03.9074781Z click-8.1.8 | 83 KB | #9 | 19%  2025-05-07T19:48:03.9075185Z 2025-05-07T19:48:03.9075192Z 2025-05-07T19:48:03.9075197Z 2025-05-07T19:48:03.9075202Z 2025-05-07T19:48:03.9075207Z 2025-05-07T19:48:03.9075213Z 2025-05-07T19:48:03.9075218Z 2025-05-07T19:48:03.9075224Z 2025-05-07T19:48:03.9075252Z 2025-05-07T19:48:03.9075258Z 2025-05-07T19:48:03.9076087Z 2025-05-07T19:48:03.9182689Z patchelf-0.18.0 | 133 KB | #2 | 12%  2025-05-07T19:48:03.9183042Z 2025-05-07T19:48:03.9183046Z 2025-05-07T19:48:03.9183050Z 2025-05-07T19:48:03.9183054Z 2025-05-07T19:48:03.9183057Z 2025-05-07T19:48:03.9183062Z 2025-05-07T19:48:03.9183066Z 2025-05-07T19:48:03.9274804Z xorg-libxau-1.0.12 | 14 KB | ########## | 100%  2025-05-07T19:48:03.9275106Z 2025-05-07T19:48:03.9275131Z 2025-05-07T19:48:03.9275136Z 2025-05-07T19:48:03.9275140Z 2025-05-07T19:48:03.9275144Z 2025-05-07T19:48:03.9275148Z 2025-05-07T19:48:03.9275152Z 2025-05-07T19:48:03.9275156Z 2025-05-07T19:48:03.9275168Z 2025-05-07T19:48:03.9275172Z 2025-05-07T19:48:03.9275176Z 2025-05-07T19:48:03.9275179Z 2025-05-07T19:48:03.9400301Z cairo-1.18.4 | 955 KB | 1 | 2%  2025-05-07T19:48:03.9400616Z 2025-05-07T19:48:03.9400904Z 2025-05-07T19:48:03.9400910Z 2025-05-07T19:48:03.9400915Z 2025-05-07T19:48:03.9400934Z 2025-05-07T19:48:03.9400940Z 2025-05-07T19:48:03.9400945Z 2025-05-07T19:48:03.9400951Z 2025-05-07T19:48:03.9400956Z 2025-05-07T19:48:03.9400961Z 2025-05-07T19:48:03.9400966Z 2025-05-07T19:48:03.9400971Z 2025-05-07T19:48:03.9400987Z 2025-05-07T19:48:03.9400992Z 2025-05-07T19:48:03.9472079Z libfreetype-2.13.3 | 8 KB | ########## | 100%  2025-05-07T19:48:03.9472454Z 2025-05-07T19:48:03.9472749Z 2025-05-07T19:48:03.9472757Z 2025-05-07T19:48:03.9472763Z 2025-05-07T19:48:03.9472768Z 2025-05-07T19:48:03.9472773Z 2025-05-07T19:48:03.9472778Z 2025-05-07T19:48:03.9472783Z 2025-05-07T19:48:03.9472788Z 2025-05-07T19:48:03.9472794Z 2025-05-07T19:48:03.9472799Z 2025-05-07T19:48:03.9472805Z 2025-05-07T19:48:03.9472810Z 2025-05-07T19:48:03.9472815Z 2025-05-07T19:48:03.9478289Z 2025-05-07T19:48:03.9646615Z distro-1.9.0 | 41 KB | ###9 | 39%  2025-05-07T19:48:03.9647072Z 2025-05-07T19:48:03.9647079Z 2025-05-07T19:48:03.9647084Z 2025-05-07T19:48:03.9647090Z 2025-05-07T19:48:03.9647095Z 2025-05-07T19:48:03.9647100Z 2025-05-07T19:48:03.9647106Z 2025-05-07T19:48:03.9647111Z 2025-05-07T19:48:03.9647117Z 2025-05-07T19:48:03.9647229Z 2025-05-07T19:48:03.9691177Z bazel-7.5.0 | 47.4 MB | 6 | 6%  2025-05-07T19:48:03.9691482Z 2025-05-07T19:48:03.9691488Z 2025-05-07T19:48:03.9691508Z 2025-05-07T19:48:03.9691514Z 2025-05-07T19:48:03.9691520Z 2025-05-07T19:48:03.9691525Z 2025-05-07T19:48:03.9691530Z 2025-05-07T19:48:03.9691536Z 2025-05-07T19:48:03.9691541Z 2025-05-07T19:48:03.9691546Z 2025-05-07T19:48:03.9691564Z 2025-05-07T19:48:03.9691569Z 2025-05-07T19:48:03.9691574Z 2025-05-07T19:48:03.9691579Z 2025-05-07T19:48:03.9691585Z 2025-05-07T19:48:03.9692481Z 2025-05-07T19:48:03.9714729Z libgcc-15.1.0 | 810 KB | 1 | 2%  2025-05-07T19:48:03.9715063Z 2025-05-07T19:48:03.9715070Z 2025-05-07T19:48:03.9715074Z 2025-05-07T19:48:03.9715077Z 2025-05-07T19:48:03.9715081Z 2025-05-07T19:48:03.9715085Z 2025-05-07T19:48:03.9715088Z 2025-05-07T19:48:03.9715092Z 2025-05-07T19:48:03.9715096Z 2025-05-07T19:48:03.9715099Z 2025-05-07T19:48:03.9715103Z 2025-05-07T19:48:03.9715106Z 2025-05-07T19:48:03.9715110Z 2025-05-07T19:48:03.9715114Z 2025-05-07T19:48:03.9715117Z 2025-05-07T19:48:03.9715121Z 2025-05-07T19:48:03.9735636Z 2025-05-07T19:48:03.9958804Z font-ttf-inconsolata | 94 KB | #6 | 17%  2025-05-07T19:48:03.9959170Z 2025-05-07T19:48:03.9959174Z 2025-05-07T19:48:03.9959178Z 2025-05-07T19:48:03.9959182Z 2025-05-07T19:48:03.9959185Z 2025-05-07T19:48:03.9959191Z 2025-05-07T19:48:03.9959198Z 2025-05-07T19:48:03.9959207Z 2025-05-07T19:48:03.9959212Z 2025-05-07T19:48:03.9959218Z 2025-05-07T19:48:03.9959251Z 2025-05-07T19:48:03.9959256Z 2025-05-07T19:48:03.9959261Z 2025-05-07T19:48:03.9959277Z 2025-05-07T19:48:03.9959282Z 2025-05-07T19:48:03.9959288Z 2025-05-07T19:48:03.9959293Z 2025-05-07T19:48:03.9959298Z 2025-05-07T19:48:04.0102926Z libxcb-1.17.0 | 387 KB | 4 | 4%  2025-05-07T19:48:04.0103252Z 2025-05-07T19:48:04.0103265Z 2025-05-07T19:48:04.0103269Z 2025-05-07T19:48:04.0103273Z 2025-05-07T19:48:04.0103277Z 2025-05-07T19:48:04.0103280Z 2025-05-07T19:48:04.0103308Z 2025-05-07T19:48:04.0103312Z 2025-05-07T19:48:04.0103316Z 2025-05-07T19:48:04.0103319Z 2025-05-07T19:48:04.0103323Z 2025-05-07T19:48:04.0103327Z 2025-05-07T19:48:04.0103331Z 2025-05-07T19:48:04.0103334Z 2025-05-07T19:48:04.0103338Z 2025-05-07T19:48:04.0103342Z 2025-05-07T19:48:04.0103345Z 2025-05-07T19:48:04.0103349Z 2025-05-07T19:48:04.0105361Z 2025-05-07T19:48:04.0441500Z ... (more hidden) ... 2025-05-07T19:48:04.0442103Z 2025-05-07T19:48:04.0442108Z 2025-05-07T19:48:04.0442125Z 2025-05-07T19:48:04.0442129Z 2025-05-07T19:48:04.0442133Z 2025-05-07T19:48:04.0442137Z 2025-05-07T19:48:04.0442141Z 2025-05-07T19:48:04.0442145Z 2025-05-07T19:48:04.0447799Z libtiff-4.7.0 | 419 KB | ########## | 100%  2025-05-07T19:48:04.0448210Z 2025-05-07T19:48:04.0448224Z 2025-05-07T19:48:04.0448230Z 2025-05-07T19:48:04.0448235Z 2025-05-07T19:48:04.0448240Z 2025-05-07T19:48:04.0448246Z 2025-05-07T19:48:04.0448485Z 2025-05-07T19:48:04.0448803Z 2025-05-07T19:48:04.0648076Z libtiff-4.7.0 | 419 KB | ########## | 100%  2025-05-07T19:48:04.0648376Z 2025-05-07T19:48:04.0648380Z 2025-05-07T19:48:04.0648384Z 2025-05-07T19:48:04.0648387Z 2025-05-07T19:48:04.0648391Z 2025-05-07T19:48:04.0648395Z 2025-05-07T19:48:04.0648399Z 2025-05-07T19:48:04.0648402Z 2025-05-07T19:48:04.0648407Z 2025-05-07T19:48:04.0650003Z 2025-05-07T19:48:04.1652521Z bazel-7.5.0 | 47.4 MB | #3 | 13%  2025-05-07T19:48:04.1652928Z 2025-05-07T19:48:04.1652932Z 2025-05-07T19:48:04.1652936Z 2025-05-07T19:48:04.1652940Z 2025-05-07T19:48:04.1652944Z 2025-05-07T19:48:04.1652948Z 2025-05-07T19:48:04.1652951Z 2025-05-07T19:48:04.1652955Z 2025-05-07T19:48:04.1652959Z 2025-05-07T19:48:04.1652963Z 2025-05-07T19:48:04.2556949Z bazel-7.5.0 | 47.4 MB | #8 | 19%  2025-05-07T19:48:04.2557241Z 2025-05-07T19:48:04.2557267Z 2025-05-07T19:48:04.2557271Z 2025-05-07T19:48:04.2557275Z 2025-05-07T19:48:04.2557278Z 2025-05-07T19:48:04.2557282Z 2025-05-07T19:48:04.2557286Z 2025-05-07T19:48:04.2557291Z 2025-05-07T19:48:04.2561855Z 2025-05-07T19:48:04.2574253Z freetype-2.13.3 | 168 KB | ########## | 100%  2025-05-07T19:48:04.2574549Z 2025-05-07T19:48:04.2574553Z 2025-05-07T19:48:04.2574557Z 2025-05-07T19:48:04.2574561Z 2025-05-07T19:48:04.2574579Z 2025-05-07T19:48:04.2574583Z 2025-05-07T19:48:04.2574587Z 2025-05-07T19:48:04.2574590Z 2025-05-07T19:48:04.2576713Z 2025-05-07T19:48:04.2814664Z freetype-2.13.3 | 168 KB | ########## | 100%  2025-05-07T19:48:04.2814949Z 2025-05-07T19:48:04.2814953Z 2025-05-07T19:48:04.2814957Z 2025-05-07T19:48:04.2814961Z 2025-05-07T19:48:04.2814965Z 2025-05-07T19:48:04.2814969Z 2025-05-07T19:48:04.2814973Z 2025-05-07T19:48:04.2814976Z 2025-05-07T19:48:04.2814980Z 2025-05-07T19:48:04.2816620Z 2025-05-07T19:48:04.4022156Z bazel-7.5.0 | 47.4 MB | ##4 | 24%  2025-05-07T19:48:04.4022630Z 2025-05-07T19:48:04.4022637Z 2025-05-07T19:48:04.4022643Z 2025-05-07T19:48:04.4022648Z 2025-05-07T19:48:04.4022654Z 2025-05-07T19:48:04.4022660Z 2025-05-07T19:48:04.4022666Z 2025-05-07T19:48:04.4022672Z 2025-05-07T19:48:04.4022677Z 2025-05-07T19:48:04.4022721Z 2025-05-07T19:48:04.5054277Z bazel-7.5.0 | 47.4 MB | ##9 | 29%  2025-05-07T19:48:04.5054696Z 2025-05-07T19:48:04.5054701Z 2025-05-07T19:48:04.5054705Z 2025-05-07T19:48:04.5054708Z 2025-05-07T19:48:04.5054712Z 2025-05-07T19:48:04.5054716Z 2025-05-07T19:48:04.5054721Z 2025-05-07T19:48:04.5054725Z 2025-05-07T19:48:04.5054741Z 2025-05-07T19:48:04.5055083Z 2025-05-07T19:48:04.5937446Z bazel-7.5.0 | 47.4 MB | ###4 | 34%  2025-05-07T19:48:04.5937752Z 2025-05-07T19:48:04.5937757Z 2025-05-07T19:48:04.5937783Z 2025-05-07T19:48:04.5937787Z 2025-05-07T19:48:04.5937791Z 2025-05-07T19:48:04.5949433Z scikit-build-0.18.1 | 114 KB | ########## | 100%  2025-05-07T19:48:04.5949709Z 2025-05-07T19:48:04.5949713Z 2025-05-07T19:48:04.5949717Z 2025-05-07T19:48:04.5949721Z 2025-05-07T19:48:04.5958836Z 2025-05-07T19:48:04.6061088Z scikit-build-0.18.1 | 114 KB | ########## | 100%  2025-05-07T19:48:04.6061385Z 2025-05-07T19:48:04.6061389Z 2025-05-07T19:48:04.6061653Z 2025-05-07T19:48:04.6061658Z 2025-05-07T19:48:04.6061664Z 2025-05-07T19:48:04.6061669Z 2025-05-07T19:48:04.6061674Z 2025-05-07T19:48:04.6061679Z 2025-05-07T19:48:04.6061685Z 2025-05-07T19:48:04.6063206Z 2025-05-07T19:48:04.7146954Z bazel-7.5.0 | 47.4 MB | ###9 | 39%  2025-05-07T19:48:04.7147249Z 2025-05-07T19:48:04.7147254Z 2025-05-07T19:48:04.7147257Z 2025-05-07T19:48:04.7147261Z 2025-05-07T19:48:04.7147266Z 2025-05-07T19:48:04.7147271Z 2025-05-07T19:48:04.7147516Z 2025-05-07T19:48:04.7147523Z 2025-05-07T19:48:04.7147527Z 2025-05-07T19:48:04.7149799Z 2025-05-07T19:48:04.7167407Z bazel-7.5.0 | 47.4 MB | ####4 | 44%  2025-05-07T19:48:04.7167742Z 2025-05-07T19:48:04.7167747Z 2025-05-07T19:48:04.7167751Z 2025-05-07T19:48:04.7167755Z 2025-05-07T19:48:04.7167759Z 2025-05-07T19:48:04.7167763Z 2025-05-07T19:48:04.7167767Z 2025-05-07T19:48:04.7167770Z 2025-05-07T19:48:04.7167774Z 2025-05-07T19:48:04.7167792Z 2025-05-07T19:48:04.7167797Z 2025-05-07T19:48:04.7167800Z 2025-05-07T19:48:04.7175520Z 2025-05-07T19:48:04.7201468Z click-8.1.8 | 83 KB | ########## | 100%  2025-05-07T19:48:04.7201829Z 2025-05-07T19:48:04.7201835Z 2025-05-07T19:48:04.7201840Z 2025-05-07T19:48:04.7201846Z 2025-05-07T19:48:04.7201851Z 2025-05-07T19:48:04.7201856Z 2025-05-07T19:48:04.7201861Z 2025-05-07T19:48:04.7201867Z 2025-05-07T19:48:04.7201872Z 2025-05-07T19:48:04.7201893Z 2025-05-07T19:48:04.7201899Z 2025-05-07T19:48:04.7201914Z 2025-05-07T19:48:04.7201926Z 2025-05-07T19:48:04.8221586Z click-8.1.8 | 83 KB | ########## | 100%  2025-05-07T19:48:04.8222014Z 2025-05-07T19:48:04.8222021Z 2025-05-07T19:48:04.8222027Z 2025-05-07T19:48:04.8222032Z 2025-05-07T19:48:04.8222047Z 2025-05-07T19:48:04.8222054Z 2025-05-07T19:48:04.8222060Z 2025-05-07T19:48:04.8222065Z 2025-05-07T19:48:04.8222070Z 2025-05-07T19:48:04.8232627Z 2025-05-07T19:48:04.8698982Z bazel-7.5.0 | 47.4 MB | ####8 | 49%  2025-05-07T19:48:04.8699262Z 2025-05-07T19:48:04.8699266Z 2025-05-07T19:48:04.8699279Z 2025-05-07T19:48:04.8699283Z 2025-05-07T19:48:04.8699286Z 2025-05-07T19:48:04.8699290Z 2025-05-07T19:48:04.8730354Z krb5-1.21.3 | 1.3 MB | ########## | 100%  2025-05-07T19:48:04.8730613Z 2025-05-07T19:48:04.8730617Z 2025-05-07T19:48:04.8730621Z 2025-05-07T19:48:04.8730646Z 2025-05-07T19:48:04.8730658Z 2025-05-07T19:48:04.8751553Z 2025-05-07T19:48:04.8764159Z krb5-1.21.3 | 1.3 MB | ########## | 100%  2025-05-07T19:48:04.8767395Z 2025-05-07T19:48:04.8778849Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T19:48:04.8780525Z 2025-05-07T19:48:04.9286410Z libopenblas-0.3.29 | 5.6 MB | ########## | 100%  2025-05-07T19:48:04.9286776Z 2025-05-07T19:48:04.9286782Z 2025-05-07T19:48:04.9286788Z 2025-05-07T19:48:04.9286818Z 2025-05-07T19:48:04.9286824Z 2025-05-07T19:48:04.9286829Z 2025-05-07T19:48:04.9286834Z 2025-05-07T19:48:04.9286840Z 2025-05-07T19:48:04.9286845Z 2025-05-07T19:48:04.9287059Z 2025-05-07T19:48:04.9410361Z bazel-7.5.0 | 47.4 MB | #####3 | 53%  2025-05-07T19:48:04.9410732Z 2025-05-07T19:48:04.9410737Z 2025-05-07T19:48:04.9410743Z 2025-05-07T19:48:04.9410748Z 2025-05-07T19:48:04.9410753Z 2025-05-07T19:48:04.9410759Z 2025-05-07T19:48:04.9410764Z 2025-05-07T19:48:04.9410784Z 2025-05-07T19:48:04.9410790Z 2025-05-07T19:48:04.9410795Z 2025-05-07T19:48:04.9410800Z 2025-05-07T19:48:04.9410805Z 2025-05-07T19:48:04.9410811Z 2025-05-07T19:48:04.9413023Z 2025-05-07T19:48:04.9512553Z libfreetype-2.13.3 | 8 KB | ########## | 100%  2025-05-07T19:48:04.9512953Z 2025-05-07T19:48:04.9512958Z 2025-05-07T19:48:04.9512963Z 2025-05-07T19:48:04.9512969Z 2025-05-07T19:48:04.9512974Z 2025-05-07T19:48:04.9513279Z 2025-05-07T19:48:04.9513283Z 2025-05-07T19:48:04.9513286Z 2025-05-07T19:48:04.9513290Z 2025-05-07T19:48:04.9513294Z 2025-05-07T19:48:04.9515674Z 2025-05-07T19:48:04.9522263Z patchelf-0.18.0 | 133 KB | ########## | 100%  2025-05-07T19:48:04.9522569Z 2025-05-07T19:48:04.9522573Z 2025-05-07T19:48:04.9522577Z 2025-05-07T19:48:04.9522580Z 2025-05-07T19:48:04.9522584Z 2025-05-07T19:48:04.9522597Z 2025-05-07T19:48:04.9522600Z 2025-05-07T19:48:04.9522604Z 2025-05-07T19:48:04.9522817Z 2025-05-07T19:48:04.9522822Z 2025-05-07T19:48:04.9522826Z 2025-05-07T19:48:05.0342631Z patchelf-0.18.0 | 133 KB | ########## | 100%  2025-05-07T19:48:05.0342931Z 2025-05-07T19:48:05.0342935Z 2025-05-07T19:48:05.0342947Z 2025-05-07T19:48:05.0342951Z 2025-05-07T19:48:05.0342955Z 2025-05-07T19:48:05.0342959Z 2025-05-07T19:48:05.0342962Z 2025-05-07T19:48:05.0342966Z 2025-05-07T19:48:05.0342970Z 2025-05-07T19:48:05.0344496Z 2025-05-07T19:48:05.1422695Z bazel-7.5.0 | 47.4 MB | #####7 | 58%  2025-05-07T19:48:05.1423109Z 2025-05-07T19:48:05.1423115Z 2025-05-07T19:48:05.1423120Z 2025-05-07T19:48:05.1423125Z 2025-05-07T19:48:05.1423130Z 2025-05-07T19:48:05.1423135Z 2025-05-07T19:48:05.1423140Z 2025-05-07T19:48:05.1423145Z 2025-05-07T19:48:05.1423150Z 2025-05-07T19:48:05.1423155Z 2025-05-07T19:48:05.2425684Z bazel-7.5.0 | 47.4 MB | ######2 | 62%  2025-05-07T19:48:05.2425978Z 2025-05-07T19:48:05.2425983Z 2025-05-07T19:48:05.2425986Z 2025-05-07T19:48:05.2425990Z 2025-05-07T19:48:05.2425994Z 2025-05-07T19:48:05.2425997Z 2025-05-07T19:48:05.2426001Z 2025-05-07T19:48:05.2426005Z 2025-05-07T19:48:05.2426008Z 2025-05-07T19:48:05.2429886Z 2025-05-07T19:48:05.3110047Z bazel-7.5.0 | 47.4 MB | ######6 | 67%  2025-05-07T19:48:05.3110382Z 2025-05-07T19:48:05.3110389Z 2025-05-07T19:48:05.3110421Z 2025-05-07T19:48:05.3110425Z 2025-05-07T19:48:05.3110428Z 2025-05-07T19:48:05.3110432Z 2025-05-07T19:48:05.3110436Z 2025-05-07T19:48:05.3110440Z 2025-05-07T19:48:05.3110443Z 2025-05-07T19:48:05.3110447Z 2025-05-07T19:48:05.3110451Z 2025-05-07T19:48:05.3115258Z 2025-05-07T19:48:05.3124446Z cairo-1.18.4 | 955 KB | ########## | 100%  2025-05-07T19:48:05.3124750Z 2025-05-07T19:48:05.3124754Z 2025-05-07T19:48:05.3124758Z 2025-05-07T19:48:05.3124762Z 2025-05-07T19:48:05.3124777Z 2025-05-07T19:48:05.3124781Z 2025-05-07T19:48:05.3124785Z 2025-05-07T19:48:05.3124789Z 2025-05-07T19:48:05.3124792Z 2025-05-07T19:48:05.3124804Z 2025-05-07T19:48:05.3124808Z 2025-05-07T19:48:05.3126818Z 2025-05-07T19:48:05.3427668Z cairo-1.18.4 | 955 KB | ########## | 100%  2025-05-07T19:48:05.3428022Z 2025-05-07T19:48:05.3428028Z 2025-05-07T19:48:05.3428034Z 2025-05-07T19:48:05.3428039Z 2025-05-07T19:48:05.3428074Z 2025-05-07T19:48:05.3428079Z 2025-05-07T19:48:05.3428085Z 2025-05-07T19:48:05.3428090Z 2025-05-07T19:48:05.3428095Z 2025-05-07T19:48:05.3430959Z 2025-05-07T19:48:05.4009518Z bazel-7.5.0 | 47.4 MB | #######1 | 71%  2025-05-07T19:48:05.4009834Z 2025-05-07T19:48:05.4009838Z 2025-05-07T19:48:05.4009841Z 2025-05-07T19:48:05.4009845Z 2025-05-07T19:48:05.4009849Z 2025-05-07T19:48:05.4009853Z 2025-05-07T19:48:05.4009856Z 2025-05-07T19:48:05.4009860Z 2025-05-07T19:48:05.4009888Z 2025-05-07T19:48:05.4009895Z 2025-05-07T19:48:05.4009900Z 2025-05-07T19:48:05.4009905Z 2025-05-07T19:48:05.4009910Z 2025-05-07T19:48:05.4009916Z 2025-05-07T19:48:05.4009921Z 2025-05-07T19:48:05.4009927Z 2025-05-07T19:48:05.4009932Z 2025-05-07T19:48:05.4037490Z font-ttf-inconsolata | 94 KB | ########## | 100%  2025-05-07T19:48:05.4037985Z 2025-05-07T19:48:05.4037991Z 2025-05-07T19:48:05.4037996Z 2025-05-07T19:48:05.4038271Z 2025-05-07T19:48:05.4038276Z 2025-05-07T19:48:05.4038281Z 2025-05-07T19:48:05.4038286Z 2025-05-07T19:48:05.4038291Z 2025-05-07T19:48:05.4038297Z 2025-05-07T19:48:05.4038302Z 2025-05-07T19:48:05.4038307Z 2025-05-07T19:48:05.4038313Z 2025-05-07T19:48:05.4038318Z 2025-05-07T19:48:05.4038333Z 2025-05-07T19:48:05.4038339Z 2025-05-07T19:48:05.4038344Z 2025-05-07T19:48:05.4038349Z 2025-05-07T19:48:05.4550391Z font-ttf-inconsolata | 94 KB | ########## | 100%  2025-05-07T19:48:05.4550743Z 2025-05-07T19:48:05.4550747Z 2025-05-07T19:48:05.4550759Z 2025-05-07T19:48:05.4550775Z 2025-05-07T19:48:05.4550781Z 2025-05-07T19:48:05.4550786Z 2025-05-07T19:48:05.4550791Z 2025-05-07T19:48:05.4550796Z 2025-05-07T19:48:05.4550802Z 2025-05-07T19:48:05.4550807Z 2025-05-07T19:48:05.5593100Z bazel-7.5.0 | 47.4 MB | #######5 | 76%  2025-05-07T19:48:05.5593505Z 2025-05-07T19:48:05.5593512Z 2025-05-07T19:48:05.5593543Z 2025-05-07T19:48:05.5593547Z 2025-05-07T19:48:05.5593551Z 2025-05-07T19:48:05.5593555Z 2025-05-07T19:48:05.5593558Z 2025-05-07T19:48:05.5593562Z 2025-05-07T19:48:05.5593567Z 2025-05-07T19:48:05.5593570Z 2025-05-07T19:48:05.6598714Z bazel-7.5.0 | 47.4 MB | ######## | 80%  2025-05-07T19:48:05.6599115Z 2025-05-07T19:48:05.6599122Z 2025-05-07T19:48:05.6599127Z 2025-05-07T19:48:05.6599132Z 2025-05-07T19:48:05.6599137Z 2025-05-07T19:48:05.6599143Z 2025-05-07T19:48:05.6599178Z 2025-05-07T19:48:05.6599184Z 2025-05-07T19:48:05.6599189Z 2025-05-07T19:48:05.6600650Z 2025-05-07T19:48:05.6890001Z bazel-7.5.0 | 47.4 MB | ########4 | 85%  2025-05-07T19:48:05.6890312Z 2025-05-07T19:48:05.6890316Z 2025-05-07T19:48:05.6890320Z 2025-05-07T19:48:05.6890324Z 2025-05-07T19:48:05.6890328Z 2025-05-07T19:48:05.6890332Z 2025-05-07T19:48:05.6890336Z 2025-05-07T19:48:05.6890340Z 2025-05-07T19:48:05.6890343Z 2025-05-07T19:48:05.6890368Z 2025-05-07T19:48:05.6890372Z 2025-05-07T19:48:05.6890376Z 2025-05-07T19:48:05.6890379Z 2025-05-07T19:48:05.6890383Z 2025-05-07T19:48:05.6890977Z 2025-05-07T19:48:05.6896343Z distro-1.9.0 | 41 KB | ########## | 100%  2025-05-07T19:48:05.6896718Z 2025-05-07T19:48:05.6896722Z 2025-05-07T19:48:05.6896725Z 2025-05-07T19:48:05.6896729Z 2025-05-07T19:48:05.6896733Z 2025-05-07T19:48:05.6896736Z 2025-05-07T19:48:05.6896740Z 2025-05-07T19:48:05.6896755Z 2025-05-07T19:48:05.6896768Z 2025-05-07T19:48:05.6896772Z 2025-05-07T19:48:05.6896776Z 2025-05-07T19:48:05.6896779Z 2025-05-07T19:48:05.6896783Z 2025-05-07T19:48:05.6896787Z 2025-05-07T19:48:05.6896791Z 2025-05-07T19:48:05.6943679Z distro-1.9.0 | 41 KB | ########## | 100%  2025-05-07T19:48:05.6943981Z 2025-05-07T19:48:05.6943985Z 2025-05-07T19:48:05.6943989Z 2025-05-07T19:48:05.6943993Z 2025-05-07T19:48:05.6944007Z 2025-05-07T19:48:05.6944011Z 2025-05-07T19:48:05.6944014Z 2025-05-07T19:48:05.6944018Z 2025-05-07T19:48:05.6944022Z 2025-05-07T19:48:05.6944025Z 2025-05-07T19:48:05.6944029Z 2025-05-07T19:48:05.6944032Z 2025-05-07T19:48:05.6944036Z 2025-05-07T19:48:05.6944040Z 2025-05-07T19:48:05.6944043Z 2025-05-07T19:48:05.6947329Z 2025-05-07T19:48:05.6974652Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T19:48:05.6975094Z 2025-05-07T19:48:05.6975101Z 2025-05-07T19:48:05.6975118Z 2025-05-07T19:48:05.6975124Z 2025-05-07T19:48:05.6975130Z 2025-05-07T19:48:05.6975135Z 2025-05-07T19:48:05.6975140Z 2025-05-07T19:48:05.6975146Z 2025-05-07T19:48:05.6975151Z 2025-05-07T19:48:05.6975156Z 2025-05-07T19:48:05.6975162Z 2025-05-07T19:48:05.6975167Z 2025-05-07T19:48:05.6975172Z 2025-05-07T19:48:05.6975177Z 2025-05-07T19:48:05.6975182Z 2025-05-07T19:48:05.6976425Z 2025-05-07T19:48:05.7750703Z libgcc-15.1.0 | 810 KB | ########## | 100%  2025-05-07T19:48:05.7751281Z 2025-05-07T19:48:05.7751285Z 2025-05-07T19:48:05.7751289Z 2025-05-07T19:48:05.7751293Z 2025-05-07T19:48:05.7751298Z 2025-05-07T19:48:05.7751301Z 2025-05-07T19:48:05.7751306Z 2025-05-07T19:48:05.7751310Z 2025-05-07T19:48:05.7751314Z 2025-05-07T19:48:05.7751317Z 2025-05-07T19:48:05.8539274Z bazel-7.5.0 | 47.4 MB | ########9 | 89%  2025-05-07T19:48:05.8539686Z 2025-05-07T19:48:05.8539690Z 2025-05-07T19:48:05.8539945Z 2025-05-07T19:48:05.8539950Z 2025-05-07T19:48:05.8539954Z 2025-05-07T19:48:05.8539958Z 2025-05-07T19:48:05.8539962Z 2025-05-07T19:48:05.8539966Z 2025-05-07T19:48:05.8539970Z 2025-05-07T19:48:05.8539975Z 2025-05-07T19:48:05.8539978Z 2025-05-07T19:48:05.8539982Z 2025-05-07T19:48:05.8539986Z 2025-05-07T19:48:05.8539990Z 2025-05-07T19:48:05.8539993Z 2025-05-07T19:48:05.8539997Z 2025-05-07T19:48:05.8540001Z 2025-05-07T19:48:05.8540020Z 2025-05-07T19:48:05.8540033Z 2025-05-07T19:48:05.8548185Z ... (more hidden) ... 2025-05-07T19:48:05.8548483Z 2025-05-07T19:48:05.8548488Z 2025-05-07T19:48:05.8548492Z 2025-05-07T19:48:05.8548495Z 2025-05-07T19:48:05.8548499Z 2025-05-07T19:48:05.8548503Z 2025-05-07T19:48:05.8548507Z 2025-05-07T19:48:05.8548510Z 2025-05-07T19:48:05.8548514Z 2025-05-07T19:48:05.8548518Z 2025-05-07T19:48:05.8548521Z 2025-05-07T19:48:05.8548534Z 2025-05-07T19:48:05.8548538Z 2025-05-07T19:48:05.8548551Z 2025-05-07T19:48:05.8548555Z 2025-05-07T19:48:05.8548559Z 2025-05-07T19:48:05.8548562Z 2025-05-07T19:48:05.8548566Z 2025-05-07T19:48:05.8549234Z 2025-05-07T19:48:05.8829196Z ... (more hidden) ... 2025-05-07T19:48:05.8829485Z 2025-05-07T19:48:05.8829489Z 2025-05-07T19:48:05.8829493Z 2025-05-07T19:48:05.8829617Z 2025-05-07T19:48:05.8829947Z 2025-05-07T19:48:05.8829953Z 2025-05-07T19:48:05.8829963Z 2025-05-07T19:48:05.8829982Z 2025-05-07T19:48:05.8829988Z 2025-05-07T19:48:05.8830918Z 2025-05-07T19:48:05.9831252Z bazel-7.5.0 | 47.4 MB | #########3 | 93%  2025-05-07T19:48:05.9831568Z 2025-05-07T19:48:05.9831578Z 2025-05-07T19:48:05.9831590Z 2025-05-07T19:48:05.9831595Z 2025-05-07T19:48:05.9831599Z 2025-05-07T19:48:05.9831602Z 2025-05-07T19:48:05.9831606Z 2025-05-07T19:48:05.9831610Z 2025-05-07T19:48:05.9831614Z 2025-05-07T19:48:05.9831617Z 2025-05-07T19:48:06.0876361Z bazel-7.5.0 | 47.4 MB | #########8 | 98%  2025-05-07T19:48:06.0876810Z 2025-05-07T19:48:06.0876815Z 2025-05-07T19:48:06.0876819Z 2025-05-07T19:48:06.0876822Z 2025-05-07T19:48:06.0876826Z 2025-05-07T19:48:06.0876830Z 2025-05-07T19:48:06.0876833Z 2025-05-07T19:48:06.0876837Z 2025-05-07T19:48:06.0876841Z 2025-05-07T19:48:06.0876845Z 2025-05-07T19:48:06.0876848Z 2025-05-07T19:48:06.0876852Z 2025-05-07T19:48:06.0876856Z 2025-05-07T19:48:06.0876872Z 2025-05-07T19:48:06.0876876Z 2025-05-07T19:48:06.0876880Z 2025-05-07T19:48:06.0876883Z 2025-05-07T19:48:06.0879559Z 2025-05-07T19:48:06.0895154Z libxcb-1.17.0 | 387 KB | ########## | 100%  2025-05-07T19:48:06.0895487Z 2025-05-07T19:48:06.0895491Z 2025-05-07T19:48:06.0895495Z 2025-05-07T19:48:06.0895499Z 2025-05-07T19:48:06.0895503Z 2025-05-07T19:48:06.0895507Z 2025-05-07T19:48:06.0895519Z 2025-05-07T19:48:06.0895523Z 2025-05-07T19:48:06.0895526Z 2025-05-07T19:48:06.0895539Z 2025-05-07T19:48:06.0895543Z 2025-05-07T19:48:06.0895547Z 2025-05-07T19:48:06.0895551Z 2025-05-07T19:48:06.0895555Z 2025-05-07T19:48:06.0895558Z 2025-05-07T19:48:06.0895562Z 2025-05-07T19:48:06.0895566Z 2025-05-07T19:48:06.0895570Z 2025-05-07T19:48:06.4166550Z libxcb-1.17.0 | 387 KB | ########## | 100%  2025-05-07T19:48:06.4166980Z 2025-05-07T19:48:06.4168271Z 2025-05-07T19:48:11.8907496Z libprotobuf-5.29.3 | 3.2 MB | ########## | 100%  2025-05-07T19:48:11.8907895Z 2025-05-07T19:48:11.8907901Z 2025-05-07T19:48:11.8907907Z 2025-05-07T19:48:11.8907912Z 2025-05-07T19:48:11.8907917Z 2025-05-07T19:48:11.8907936Z 2025-05-07T19:48:11.8907944Z 2025-05-07T19:48:11.8907949Z 2025-05-07T19:48:11.8907954Z 2025-05-07T19:48:11.8907959Z 2025-05-07T19:48:14.2553726Z bazel-7.5.0 | 47.4 MB | ########## | 100%  2025-05-07T19:48:14.2554152Z 2025-05-07T19:48:14.2554441Z 2025-05-07T19:48:14.2554447Z 2025-05-07T19:48:14.2554450Z 2025-05-07T19:48:14.2554454Z 2025-05-07T19:48:14.2554458Z 2025-05-07T19:48:14.2554462Z 2025-05-07T19:48:14.2554467Z 2025-05-07T19:48:14.2554471Z 2025-05-07T19:48:14.2554475Z 2025-05-07T19:48:14.2554479Z 2025-05-07T19:48:14.2554501Z 2025-05-07T19:48:14.2554506Z 2025-05-07T19:48:14.2554510Z 2025-05-07T19:48:14.2554515Z 2025-05-07T19:48:14.2554520Z 2025-05-07T19:48:14.2554527Z 2025-05-07T19:48:14.2554532Z 2025-05-07T19:48:14.2554552Z 2025-05-07T19:48:14.2554668Z 2025-05-07T19:48:14.2555108Z  2025-05-07T19:48:14.2555407Z 2025-05-07T19:48:14.2555605Z 2025-05-07T19:48:14.2555769Z  2025-05-07T19:48:14.2555959Z 2025-05-07T19:48:14.2555964Z 2025-05-07T19:48:14.2556118Z  2025-05-07T19:48:14.2556328Z 2025-05-07T19:48:14.2556332Z 2025-05-07T19:48:14.2556335Z 2025-05-07T19:48:14.2556497Z  2025-05-07T19:48:14.2556747Z 2025-05-07T19:48:14.2556763Z 2025-05-07T19:48:14.2556768Z 2025-05-07T19:48:14.2556775Z 2025-05-07T19:48:14.2556998Z  2025-05-07T19:48:14.2557285Z 2025-05-07T19:48:14.2557289Z 2025-05-07T19:48:14.2557293Z 2025-05-07T19:48:14.2557304Z 2025-05-07T19:48:14.2557307Z 2025-05-07T19:48:14.2557558Z  2025-05-07T19:48:14.2557844Z 2025-05-07T19:48:14.2557850Z 2025-05-07T19:48:14.2557855Z 2025-05-07T19:48:14.2557861Z 2025-05-07T19:48:14.2557866Z 2025-05-07T19:48:14.2557872Z 2025-05-07T19:48:14.2558066Z  2025-05-07T19:48:14.2558354Z 2025-05-07T19:48:14.2558359Z 2025-05-07T19:48:14.2558365Z 2025-05-07T19:48:14.2558370Z 2025-05-07T19:48:14.2558385Z 2025-05-07T19:48:14.2558391Z 2025-05-07T19:48:14.2558396Z 2025-05-07T19:48:14.2558685Z  2025-05-07T19:48:14.2558968Z 2025-05-07T19:48:14.2558974Z 2025-05-07T19:48:14.2558979Z 2025-05-07T19:48:14.2558984Z 2025-05-07T19:48:14.2558989Z 2025-05-07T19:48:14.2558994Z 2025-05-07T19:48:14.2558999Z 2025-05-07T19:48:14.2559004Z 2025-05-07T19:48:14.2559248Z  2025-05-07T19:48:14.2559524Z 2025-05-07T19:48:14.2559527Z 2025-05-07T19:48:14.2559531Z 2025-05-07T19:48:14.2559535Z 2025-05-07T19:48:14.2559539Z 2025-05-07T19:48:14.2559542Z 2025-05-07T19:48:14.2559546Z 2025-05-07T19:48:14.2559549Z 2025-05-07T19:48:14.2559553Z 2025-05-07T19:48:14.2559755Z  2025-05-07T19:48:14.2560031Z 2025-05-07T19:48:14.2560035Z 2025-05-07T19:48:14.2560038Z 2025-05-07T19:48:14.2560051Z 2025-05-07T19:48:14.2560055Z 2025-05-07T19:48:14.2560058Z 2025-05-07T19:48:14.2560062Z 2025-05-07T19:48:14.2560066Z 2025-05-07T19:48:14.2560069Z 2025-05-07T19:48:14.2560073Z 2025-05-07T19:48:14.2560323Z  2025-05-07T19:48:14.2560617Z 2025-05-07T19:48:14.2560623Z 2025-05-07T19:48:14.2560628Z 2025-05-07T19:48:14.2560633Z 2025-05-07T19:48:14.2560639Z 2025-05-07T19:48:14.2560644Z 2025-05-07T19:48:14.2560875Z 2025-05-07T19:48:14.2560881Z 2025-05-07T19:48:14.2560886Z 2025-05-07T19:48:14.2560892Z 2025-05-07T19:48:14.2560897Z 2025-05-07T19:48:14.2561195Z  2025-05-07T19:48:14.2561490Z 2025-05-07T19:48:14.2561496Z 2025-05-07T19:48:14.2561501Z 2025-05-07T19:48:14.2561506Z 2025-05-07T19:48:14.2561511Z 2025-05-07T19:48:14.2561516Z 2025-05-07T19:48:14.2561522Z 2025-05-07T19:48:14.2561527Z 2025-05-07T19:48:14.2561532Z 2025-05-07T19:48:14.2561675Z 2025-05-07T19:48:14.2561681Z 2025-05-07T19:48:14.2561686Z 2025-05-07T19:48:14.2561960Z  2025-05-07T19:48:14.2562194Z 2025-05-07T19:48:14.2562198Z 2025-05-07T19:48:14.2562202Z 2025-05-07T19:48:14.2562205Z 2025-05-07T19:48:14.2562209Z 2025-05-07T19:48:14.2562213Z 2025-05-07T19:48:14.2562216Z 2025-05-07T19:48:14.2562220Z 2025-05-07T19:48:14.2562223Z 2025-05-07T19:48:14.2562237Z 2025-05-07T19:48:14.2562240Z 2025-05-07T19:48:14.2562244Z 2025-05-07T19:48:14.2562257Z 2025-05-07T19:48:14.2562523Z  2025-05-07T19:48:14.2562821Z 2025-05-07T19:48:14.2562827Z 2025-05-07T19:48:14.2562832Z 2025-05-07T19:48:14.2562837Z 2025-05-07T19:48:14.2562843Z 2025-05-07T19:48:14.2562848Z 2025-05-07T19:48:14.2562853Z 2025-05-07T19:48:14.2562858Z 2025-05-07T19:48:14.2562863Z 2025-05-07T19:48:14.2562877Z 2025-05-07T19:48:14.2562891Z 2025-05-07T19:48:14.2562896Z 2025-05-07T19:48:14.2562901Z 2025-05-07T19:48:14.2562906Z 2025-05-07T19:48:14.2563178Z  2025-05-07T19:48:14.2563487Z 2025-05-07T19:48:14.2563492Z 2025-05-07T19:48:14.2563497Z 2025-05-07T19:48:14.2563503Z 2025-05-07T19:48:14.2563508Z 2025-05-07T19:48:14.2563520Z 2025-05-07T19:48:14.2563526Z 2025-05-07T19:48:14.2563531Z 2025-05-07T19:48:14.2563577Z 2025-05-07T19:48:14.2563582Z 2025-05-07T19:48:14.2563587Z 2025-05-07T19:48:14.2563592Z 2025-05-07T19:48:14.2563598Z 2025-05-07T19:48:14.2563603Z 2025-05-07T19:48:14.2563608Z 2025-05-07T19:48:14.2563876Z  2025-05-07T19:48:14.2564184Z 2025-05-07T19:48:14.2564189Z 2025-05-07T19:48:14.2564195Z 2025-05-07T19:48:14.2564200Z 2025-05-07T19:48:14.2564205Z 2025-05-07T19:48:14.2564210Z 2025-05-07T19:48:14.2564222Z 2025-05-07T19:48:14.2564227Z 2025-05-07T19:48:14.2564232Z 2025-05-07T19:48:14.2564238Z 2025-05-07T19:48:14.2564243Z 2025-05-07T19:48:14.2564248Z 2025-05-07T19:48:14.2564253Z 2025-05-07T19:48:14.2564258Z 2025-05-07T19:48:14.2564264Z 2025-05-07T19:48:14.2564269Z 2025-05-07T19:48:14.2564553Z  2025-05-07T19:48:14.2564867Z 2025-05-07T19:48:14.2564873Z 2025-05-07T19:48:14.2564887Z 2025-05-07T19:48:14.2564892Z 2025-05-07T19:48:14.2564897Z 2025-05-07T19:48:14.2564903Z 2025-05-07T19:48:14.2564908Z 2025-05-07T19:48:14.2564913Z 2025-05-07T19:48:14.2564918Z 2025-05-07T19:48:14.2564923Z 2025-05-07T19:48:14.2564937Z 2025-05-07T19:48:14.2564943Z 2025-05-07T19:48:14.2564948Z 2025-05-07T19:48:14.2564953Z 2025-05-07T19:48:14.2564958Z 2025-05-07T19:48:14.2564963Z 2025-05-07T19:48:14.2564969Z 2025-05-07T19:48:14.2565256Z  2025-05-07T19:48:14.2565562Z 2025-05-07T19:48:14.2565567Z 2025-05-07T19:48:14.2565572Z 2025-05-07T19:48:14.2565586Z 2025-05-07T19:48:14.2565590Z 2025-05-07T19:48:14.2565596Z 2025-05-07T19:48:14.2565601Z 2025-05-07T19:48:14.2565606Z 2025-05-07T19:48:14.2565611Z 2025-05-07T19:48:14.2565617Z 2025-05-07T19:48:14.2565622Z 2025-05-07T19:48:14.2565627Z 2025-05-07T19:48:14.2565632Z 2025-05-07T19:48:14.2565637Z 2025-05-07T19:48:14.2565642Z 2025-05-07T19:48:14.2565764Z 2025-05-07T19:48:14.2565768Z 2025-05-07T19:48:14.2565772Z 2025-05-07T19:48:14.2565996Z  2025-05-07T19:48:14.2566328Z 2025-05-07T19:48:14.2566334Z 2025-05-07T19:48:14.2566469Z  2025-05-07T19:48:14.2566602Z 2025-05-07T19:48:14.2566607Z 2025-05-07T19:48:14.2566742Z  2025-05-07T19:48:14.2566880Z 2025-05-07T19:48:14.2566886Z 2025-05-07T19:48:14.2566891Z 2025-05-07T19:48:14.2567026Z  2025-05-07T19:48:14.2567278Z 2025-05-07T19:48:14.2567284Z 2025-05-07T19:48:14.2567289Z 2025-05-07T19:48:14.2567295Z 2025-05-07T19:48:14.2567434Z  2025-05-07T19:48:14.2567592Z 2025-05-07T19:48:14.2567598Z 2025-05-07T19:48:14.2567603Z 2025-05-07T19:48:14.2567608Z 2025-05-07T19:48:14.2567621Z 2025-05-07T19:48:14.2567762Z  2025-05-07T19:48:14.2567919Z 2025-05-07T19:48:14.2567925Z 2025-05-07T19:48:14.2567930Z 2025-05-07T19:48:14.2567935Z 2025-05-07T19:48:14.2567949Z 2025-05-07T19:48:14.2567955Z 2025-05-07T19:48:14.2568171Z  2025-05-07T19:48:14.2568336Z 2025-05-07T19:48:14.2568341Z 2025-05-07T19:48:14.2568346Z 2025-05-07T19:48:14.2568351Z 2025-05-07T19:48:14.2568357Z 2025-05-07T19:48:14.2568362Z 2025-05-07T19:48:14.2568367Z 2025-05-07T19:48:14.2568523Z  2025-05-07T19:48:14.2568700Z 2025-05-07T19:48:14.2568706Z 2025-05-07T19:48:14.2568711Z 2025-05-07T19:48:14.2568716Z 2025-05-07T19:48:14.2568721Z 2025-05-07T19:48:14.2568727Z 2025-05-07T19:48:14.2568739Z 2025-05-07T19:48:14.2568744Z 2025-05-07T19:48:14.2568910Z  2025-05-07T19:48:14.2569115Z 2025-05-07T19:48:14.2569120Z 2025-05-07T19:48:14.2569125Z 2025-05-07T19:48:14.2569130Z 2025-05-07T19:48:14.2569136Z 2025-05-07T19:48:14.2569141Z 2025-05-07T19:48:14.2569146Z 2025-05-07T19:48:14.2569151Z 2025-05-07T19:48:14.2569156Z 2025-05-07T19:48:14.2569328Z  2025-05-07T19:48:14.2569531Z 2025-05-07T19:48:14.2569543Z 2025-05-07T19:48:14.2569548Z 2025-05-07T19:48:14.2569554Z 2025-05-07T19:48:14.2569559Z 2025-05-07T19:48:14.2569564Z 2025-05-07T19:48:14.2569569Z 2025-05-07T19:48:14.2569574Z 2025-05-07T19:48:14.2569580Z 2025-05-07T19:48:14.2569585Z 2025-05-07T19:48:14.2569764Z  2025-05-07T19:48:14.2569982Z 2025-05-07T19:48:14.2569987Z 2025-05-07T19:48:14.2569992Z 2025-05-07T19:48:14.2569997Z 2025-05-07T19:48:14.2570002Z 2025-05-07T19:48:14.2570008Z 2025-05-07T19:48:14.2570013Z 2025-05-07T19:48:14.2570024Z 2025-05-07T19:48:14.2570030Z 2025-05-07T19:48:14.2570035Z 2025-05-07T19:48:14.2570040Z 2025-05-07T19:48:14.2570210Z  2025-05-07T19:48:14.2570434Z 2025-05-07T19:48:14.2570439Z 2025-05-07T19:48:14.2570444Z 2025-05-07T19:48:14.2570449Z 2025-05-07T19:48:14.2570454Z 2025-05-07T19:48:14.2570459Z 2025-05-07T19:48:14.2570465Z 2025-05-07T19:48:14.2570470Z 2025-05-07T19:48:14.2570475Z 2025-05-07T19:48:14.2570480Z 2025-05-07T19:48:14.2570491Z 2025-05-07T19:48:14.2570496Z 2025-05-07T19:48:14.2570700Z  2025-05-07T19:48:14.2570984Z 2025-05-07T19:48:14.2570990Z 2025-05-07T19:48:14.2570996Z 2025-05-07T19:48:14.2571001Z 2025-05-07T19:48:14.2571006Z 2025-05-07T19:48:14.2571011Z 2025-05-07T19:48:14.2571017Z 2025-05-07T19:48:14.2571022Z 2025-05-07T19:48:14.2571027Z 2025-05-07T19:48:14.2571032Z 2025-05-07T19:48:14.2571037Z 2025-05-07T19:48:14.2571042Z 2025-05-07T19:48:14.2571048Z 2025-05-07T19:48:14.2571250Z  2025-05-07T19:48:14.2571496Z 2025-05-07T19:48:14.2571501Z 2025-05-07T19:48:14.2571507Z 2025-05-07T19:48:14.2571512Z 2025-05-07T19:48:14.2571517Z 2025-05-07T19:48:14.2571522Z 2025-05-07T19:48:14.2571527Z 2025-05-07T19:48:14.2571533Z 2025-05-07T19:48:14.2571538Z 2025-05-07T19:48:14.2571543Z 2025-05-07T19:48:14.2571548Z 2025-05-07T19:48:14.2571553Z 2025-05-07T19:48:14.2571559Z 2025-05-07T19:48:14.2571564Z 2025-05-07T19:48:14.2571762Z  2025-05-07T19:48:14.2572153Z 2025-05-07T19:48:14.2572159Z 2025-05-07T19:48:14.2572164Z 2025-05-07T19:48:14.2572169Z 2025-05-07T19:48:14.2572174Z 2025-05-07T19:48:14.2572180Z 2025-05-07T19:48:14.2572185Z 2025-05-07T19:48:14.2572190Z 2025-05-07T19:48:14.2572195Z 2025-05-07T19:48:14.2572200Z 2025-05-07T19:48:14.2572214Z 2025-05-07T19:48:14.2572220Z 2025-05-07T19:48:14.2572225Z 2025-05-07T19:48:14.2572230Z 2025-05-07T19:48:14.2572236Z 2025-05-07T19:48:14.2572537Z  2025-05-07T19:48:14.2572801Z 2025-05-07T19:48:14.2572806Z 2025-05-07T19:48:14.2572812Z 2025-05-07T19:48:14.2572817Z 2025-05-07T19:48:14.2572822Z 2025-05-07T19:48:14.2572827Z 2025-05-07T19:48:14.2572842Z 2025-05-07T19:48:14.2572848Z 2025-05-07T19:48:14.2572853Z 2025-05-07T19:48:14.2572858Z 2025-05-07T19:48:14.2572863Z 2025-05-07T19:48:14.2572869Z 2025-05-07T19:48:14.2572874Z 2025-05-07T19:48:14.2572879Z 2025-05-07T19:48:14.2572884Z 2025-05-07T19:48:14.2572900Z 2025-05-07T19:48:14.2573139Z  2025-05-07T19:48:14.2573417Z 2025-05-07T19:48:14.2573422Z 2025-05-07T19:48:14.2573427Z 2025-05-07T19:48:14.2573432Z 2025-05-07T19:48:14.2573437Z 2025-05-07T19:48:14.2573443Z 2025-05-07T19:48:14.2573448Z 2025-05-07T19:48:14.2573453Z 2025-05-07T19:48:14.2573459Z 2025-05-07T19:48:14.2573464Z 2025-05-07T19:48:14.2573469Z 2025-05-07T19:48:14.2573474Z 2025-05-07T19:48:14.2573480Z 2025-05-07T19:48:14.2573485Z 2025-05-07T19:48:14.2573497Z 2025-05-07T19:48:14.2573502Z 2025-05-07T19:48:14.2573507Z 2025-05-07T19:48:14.2573726Z  2025-05-07T19:48:14.2573998Z 2025-05-07T19:48:14.2574003Z 2025-05-07T19:48:14.2574008Z 2025-05-07T19:48:14.2574014Z 2025-05-07T19:48:14.2574019Z 2025-05-07T19:48:14.2574024Z 2025-05-07T19:48:14.2574029Z 2025-05-07T19:48:14.2574035Z 2025-05-07T19:48:14.2574040Z 2025-05-07T19:48:14.2574045Z 2025-05-07T19:48:14.2574050Z 2025-05-07T19:48:14.2574062Z 2025-05-07T19:48:14.2574067Z 2025-05-07T19:48:14.2574072Z 2025-05-07T19:48:14.2574086Z 2025-05-07T19:48:14.2574091Z 2025-05-07T19:48:14.2574096Z 2025-05-07T19:48:14.2574102Z 2025-05-07T19:48:14.2574349Z  2025-05-07T19:48:14.2574625Z 2025-05-07T19:48:14.2574631Z 2025-05-07T19:48:14.2574764Z  2025-05-07T19:48:14.2574894Z 2025-05-07T19:48:14.2574899Z 2025-05-07T19:48:14.2575028Z  2025-05-07T19:48:14.2575167Z 2025-05-07T19:48:14.2575184Z 2025-05-07T19:48:14.2575197Z 2025-05-07T19:48:14.2575328Z  2025-05-07T19:48:14.2575465Z 2025-05-07T19:48:14.2575471Z 2025-05-07T19:48:14.2575476Z 2025-05-07T19:48:14.2575481Z 2025-05-07T19:48:14.2575627Z  2025-05-07T19:48:14.2575775Z 2025-05-07T19:48:14.2575781Z 2025-05-07T19:48:14.2575786Z 2025-05-07T19:48:14.2575791Z 2025-05-07T19:48:14.2575796Z 2025-05-07T19:48:14.2575974Z  2025-05-07T19:48:14.2576141Z 2025-05-07T19:48:14.2576147Z 2025-05-07T19:48:14.2576159Z 2025-05-07T19:48:14.2576164Z 2025-05-07T19:48:14.2576170Z 2025-05-07T19:48:14.2576175Z 2025-05-07T19:48:14.2576318Z  2025-05-07T19:48:14.2576492Z 2025-05-07T19:48:14.2576497Z 2025-05-07T19:48:14.2576502Z 2025-05-07T19:48:14.2576508Z 2025-05-07T19:48:14.2576513Z 2025-05-07T19:48:14.2576518Z 2025-05-07T19:48:14.2576523Z 2025-05-07T19:48:14.2576673Z  2025-05-07T19:48:14.2576850Z 2025-05-07T19:48:14.2576855Z 2025-05-07T19:48:14.2576860Z 2025-05-07T19:48:14.2576874Z 2025-05-07T19:48:14.2576887Z 2025-05-07T19:48:14.2576892Z 2025-05-07T19:48:14.2576898Z 2025-05-07T19:48:14.2576903Z 2025-05-07T19:48:14.2577056Z  2025-05-07T19:48:14.2577247Z 2025-05-07T19:48:14.2577252Z 2025-05-07T19:48:14.2577257Z 2025-05-07T19:48:14.2577262Z 2025-05-07T19:48:14.2577267Z 2025-05-07T19:48:14.2577272Z 2025-05-07T19:48:14.2577278Z 2025-05-07T19:48:14.2577291Z 2025-05-07T19:48:14.2577296Z 2025-05-07T19:48:14.2577452Z  2025-05-07T19:48:14.2577767Z 2025-05-07T19:48:14.2577773Z 2025-05-07T19:48:14.2577778Z 2025-05-07T19:48:14.2577783Z 2025-05-07T19:48:14.2577789Z 2025-05-07T19:48:14.2577794Z 2025-05-07T19:48:14.2577799Z 2025-05-07T19:48:14.2577805Z 2025-05-07T19:48:14.2577810Z 2025-05-07T19:48:14.2577825Z 2025-05-07T19:48:14.2578003Z  2025-05-07T19:48:14.2578215Z 2025-05-07T19:48:14.2578220Z 2025-05-07T19:48:14.2578225Z 2025-05-07T19:48:14.2578231Z 2025-05-07T19:48:14.2578236Z 2025-05-07T19:48:14.2578333Z 2025-05-07T19:48:14.2578339Z 2025-05-07T19:48:14.2578344Z 2025-05-07T19:48:14.2578349Z 2025-05-07T19:48:14.2578355Z 2025-05-07T19:48:14.2578368Z 2025-05-07T19:48:14.2578566Z  2025-05-07T19:48:14.2578796Z 2025-05-07T19:48:14.2578801Z 2025-05-07T19:48:14.2578807Z 2025-05-07T19:48:14.2578814Z 2025-05-07T19:48:14.2578825Z 2025-05-07T19:48:14.2578851Z 2025-05-07T19:48:14.2578857Z 2025-05-07T19:48:14.2578862Z 2025-05-07T19:48:14.2578876Z 2025-05-07T19:48:14.2578882Z 2025-05-07T19:48:14.2578887Z 2025-05-07T19:48:14.2578892Z 2025-05-07T19:48:14.2579079Z  2025-05-07T19:48:14.2579315Z 2025-05-07T19:48:14.2579320Z 2025-05-07T19:48:14.2579325Z 2025-05-07T19:48:14.2579331Z 2025-05-07T19:48:14.2579343Z 2025-05-07T19:48:14.2579348Z 2025-05-07T19:48:14.2579353Z 2025-05-07T19:48:14.2579359Z 2025-05-07T19:48:14.2579364Z 2025-05-07T19:48:14.2579369Z 2025-05-07T19:48:14.2579374Z 2025-05-07T19:48:14.2579380Z 2025-05-07T19:48:14.2579391Z 2025-05-07T19:48:14.2579569Z  2025-05-07T19:48:14.2579818Z 2025-05-07T19:48:14.2579823Z 2025-05-07T19:48:14.2579837Z 2025-05-07T19:48:14.2579842Z 2025-05-07T19:48:14.2579848Z 2025-05-07T19:48:14.2579853Z 2025-05-07T19:48:14.2579858Z 2025-05-07T19:48:14.2579863Z 2025-05-07T19:48:14.2579868Z 2025-05-07T19:48:14.2579873Z 2025-05-07T19:48:14.2579878Z 2025-05-07T19:48:14.2579883Z 2025-05-07T19:48:14.2579888Z 2025-05-07T19:48:14.2579900Z 2025-05-07T19:48:14.2580088Z  2025-05-07T19:48:14.2580348Z 2025-05-07T19:48:14.2580354Z 2025-05-07T19:48:14.2580359Z 2025-05-07T19:48:14.2580364Z 2025-05-07T19:48:14.2580370Z 2025-05-07T19:48:14.2580375Z 2025-05-07T19:48:14.2580380Z 2025-05-07T19:48:14.2580385Z 2025-05-07T19:48:14.2580391Z 2025-05-07T19:48:14.2580396Z 2025-05-07T19:48:14.2580401Z 2025-05-07T19:48:14.2580406Z 2025-05-07T19:48:14.2580412Z 2025-05-07T19:48:14.2580417Z 2025-05-07T19:48:14.2580428Z 2025-05-07T19:48:14.2580657Z  2025-05-07T19:48:14.2580915Z 2025-05-07T19:48:14.2580920Z 2025-05-07T19:48:14.2580925Z 2025-05-07T19:48:14.2580930Z 2025-05-07T19:48:14.2580936Z 2025-05-07T19:48:14.2580941Z 2025-05-07T19:48:14.2580946Z 2025-05-07T19:48:14.2580951Z 2025-05-07T19:48:14.2580956Z 2025-05-07T19:48:14.2580961Z 2025-05-07T19:48:14.2580967Z 2025-05-07T19:48:14.2580972Z 2025-05-07T19:48:14.2580977Z 2025-05-07T19:48:14.2580989Z 2025-05-07T19:48:14.2581004Z 2025-05-07T19:48:14.2581009Z 2025-05-07T19:48:14.2581189Z  2025-05-07T19:48:14.2581397Z 2025-05-07T19:48:14.2581402Z 2025-05-07T19:48:14.2581407Z 2025-05-07T19:48:14.2581412Z 2025-05-07T19:48:14.2581417Z 2025-05-07T19:48:14.2581423Z 2025-05-07T19:48:14.2581428Z 2025-05-07T19:48:14.2581433Z 2025-05-07T19:48:14.2581447Z 2025-05-07T19:48:14.2581452Z 2025-05-07T19:48:14.2581457Z 2025-05-07T19:48:14.2581462Z 2025-05-07T19:48:14.2581476Z 2025-05-07T19:48:14.2581481Z 2025-05-07T19:48:14.2581486Z 2025-05-07T19:48:14.2581491Z 2025-05-07T19:48:14.2581496Z 2025-05-07T19:48:14.2581718Z  2025-05-07T19:48:14.2581989Z 2025-05-07T19:48:14.2581994Z 2025-05-07T19:48:14.2582008Z 2025-05-07T19:48:14.2582014Z 2025-05-07T19:48:14.2582019Z 2025-05-07T19:48:14.2582024Z 2025-05-07T19:48:14.2582029Z 2025-05-07T19:48:14.2582035Z 2025-05-07T19:48:14.2582040Z 2025-05-07T19:48:14.2582187Z 2025-05-07T19:48:14.2582193Z 2025-05-07T19:48:14.2582198Z 2025-05-07T19:48:14.2582203Z 2025-05-07T19:48:14.2582209Z 2025-05-07T19:48:14.2582214Z 2025-05-07T19:48:14.2582219Z 2025-05-07T19:48:14.2582224Z 2025-05-07T19:48:14.2582230Z 2025-05-07T19:48:14.2582471Z  2025-05-07T19:48:14.2582754Z 2025-05-07T19:48:14.2582759Z 2025-05-07T19:48:14.2582905Z  2025-05-07T19:48:14.2583039Z 2025-05-07T19:48:14.2583055Z 2025-05-07T19:48:14.2583282Z  2025-05-07T19:48:14.2583423Z 2025-05-07T19:48:14.2583428Z 2025-05-07T19:48:14.2583433Z 2025-05-07T19:48:14.2583568Z  2025-05-07T19:48:14.2583710Z 2025-05-07T19:48:14.2583715Z 2025-05-07T19:48:14.2583721Z 2025-05-07T19:48:14.2583727Z 2025-05-07T19:48:14.2583862Z  2025-05-07T19:48:14.2584008Z 2025-05-07T19:48:14.2584013Z 2025-05-07T19:48:14.2584027Z 2025-05-07T19:48:14.2584033Z 2025-05-07T19:48:14.2584038Z 2025-05-07T19:48:14.2584190Z  2025-05-07T19:48:14.2584357Z 2025-05-07T19:48:14.2584362Z 2025-05-07T19:48:14.2584367Z 2025-05-07T19:48:14.2584372Z 2025-05-07T19:48:14.2584379Z 2025-05-07T19:48:14.2584384Z 2025-05-07T19:48:14.2584540Z  2025-05-07T19:48:14.2584708Z 2025-05-07T19:48:14.2584713Z 2025-05-07T19:48:14.2584719Z 2025-05-07T19:48:14.2584724Z 2025-05-07T19:48:14.2584729Z 2025-05-07T19:48:14.2584734Z 2025-05-07T19:48:14.2584739Z 2025-05-07T19:48:14.2585198Z  2025-05-07T19:48:14.2585389Z 2025-05-07T19:48:14.2585403Z 2025-05-07T19:48:14.2585408Z 2025-05-07T19:48:14.2585413Z 2025-05-07T19:48:14.2585419Z 2025-05-07T19:48:14.2585424Z 2025-05-07T19:48:14.2585429Z 2025-05-07T19:48:14.2585434Z 2025-05-07T19:48:14.2585628Z  2025-05-07T19:48:14.2585816Z 2025-05-07T19:48:14.2585821Z 2025-05-07T19:48:14.2585827Z 2025-05-07T19:48:14.2585832Z 2025-05-07T19:48:14.2585837Z 2025-05-07T19:48:14.2585842Z 2025-05-07T19:48:14.2585848Z 2025-05-07T19:48:14.2585861Z 2025-05-07T19:48:14.2585866Z 2025-05-07T19:48:14.2586028Z  2025-05-07T19:48:14.2586236Z 2025-05-07T19:48:14.2586242Z 2025-05-07T19:48:14.2586247Z 2025-05-07T19:48:14.2586252Z 2025-05-07T19:48:14.2586257Z 2025-05-07T19:48:14.2586263Z 2025-05-07T19:48:14.2586268Z 2025-05-07T19:48:14.2586273Z 2025-05-07T19:48:14.2586278Z 2025-05-07T19:48:14.2586283Z 2025-05-07T19:48:14.2586476Z  2025-05-07T19:48:14.2586684Z 2025-05-07T19:48:14.2586690Z 2025-05-07T19:48:14.2586700Z 2025-05-07T19:48:14.2586706Z 2025-05-07T19:48:14.2586711Z 2025-05-07T19:48:14.2586716Z 2025-05-07T19:48:14.2586722Z 2025-05-07T19:48:14.2586727Z 2025-05-07T19:48:14.2586734Z 2025-05-07T19:48:14.2586739Z 2025-05-07T19:48:14.2586744Z 2025-05-07T19:48:14.2586944Z  2025-05-07T19:48:14.2587166Z 2025-05-07T19:48:14.2587171Z 2025-05-07T19:48:14.2587176Z 2025-05-07T19:48:14.2587182Z 2025-05-07T19:48:14.2587187Z 2025-05-07T19:48:14.2587192Z 2025-05-07T19:48:14.2587203Z 2025-05-07T19:48:14.2587208Z 2025-05-07T19:48:14.2587214Z 2025-05-07T19:48:14.2587219Z 2025-05-07T19:48:14.2587224Z 2025-05-07T19:48:14.2587229Z 2025-05-07T19:48:14.2587409Z  2025-05-07T19:48:14.2587645Z 2025-05-07T19:48:14.2587651Z 2025-05-07T19:48:14.2587656Z 2025-05-07T19:48:14.2587661Z 2025-05-07T19:48:14.2587667Z 2025-05-07T19:48:14.2587672Z 2025-05-07T19:48:14.2587677Z 2025-05-07T19:48:14.2587682Z 2025-05-07T19:48:14.2587688Z 2025-05-07T19:48:14.2587698Z 2025-05-07T19:48:14.2587703Z 2025-05-07T19:48:14.2587709Z 2025-05-07T19:48:14.2587714Z 2025-05-07T19:48:14.2587926Z  2025-05-07T19:48:14.2588164Z 2025-05-07T19:48:14.2588169Z 2025-05-07T19:48:14.2588174Z 2025-05-07T19:48:14.2588179Z 2025-05-07T19:48:14.2588185Z 2025-05-07T19:48:14.2588190Z 2025-05-07T19:48:14.2588195Z 2025-05-07T19:48:14.2588209Z 2025-05-07T19:48:14.2588214Z 2025-05-07T19:48:14.2588220Z 2025-05-07T19:48:14.2588737Z 2025-05-07T19:48:14.2588743Z 2025-05-07T19:48:14.2588748Z 2025-05-07T19:48:14.2588753Z 2025-05-07T19:48:14.2588957Z  2025-05-07T19:48:14.2589210Z 2025-05-07T19:48:14.2589216Z 2025-05-07T19:48:14.2589221Z 2025-05-07T19:48:14.2589236Z 2025-05-07T19:48:14.2589241Z 2025-05-07T19:48:14.2589247Z 2025-05-07T19:48:14.2589252Z 2025-05-07T19:48:14.2589257Z 2025-05-07T19:48:14.2589262Z 2025-05-07T19:48:14.2589268Z 2025-05-07T19:48:14.2589273Z 2025-05-07T19:48:14.2589278Z 2025-05-07T19:48:14.2589409Z 2025-05-07T19:48:14.2589416Z 2025-05-07T19:48:14.2589421Z 2025-05-07T19:48:14.2589620Z  2025-05-07T19:48:14.2589888Z 2025-05-07T19:48:14.2589893Z 2025-05-07T19:48:14.2589899Z 2025-05-07T19:48:14.2589904Z 2025-05-07T19:48:14.2589910Z 2025-05-07T19:48:14.2589915Z 2025-05-07T19:48:14.2589920Z 2025-05-07T19:48:14.2589926Z 2025-05-07T19:48:14.2589931Z 2025-05-07T19:48:14.2589936Z 2025-05-07T19:48:14.2589949Z 2025-05-07T19:48:14.2589955Z 2025-05-07T19:48:14.2589961Z 2025-05-07T19:48:14.2589966Z 2025-05-07T19:48:14.2589971Z 2025-05-07T19:48:14.2589976Z 2025-05-07T19:48:14.2590174Z  2025-05-07T19:48:14.2590447Z 2025-05-07T19:48:14.2590453Z 2025-05-07T19:48:14.2590458Z 2025-05-07T19:48:14.2590463Z 2025-05-07T19:48:14.2590468Z 2025-05-07T19:48:14.2590474Z 2025-05-07T19:48:14.2590479Z 2025-05-07T19:48:14.2590484Z 2025-05-07T19:48:14.2590489Z 2025-05-07T19:48:14.2590525Z 2025-05-07T19:48:14.2590537Z 2025-05-07T19:48:14.2590542Z 2025-05-07T19:48:14.2590548Z 2025-05-07T19:48:14.2590560Z 2025-05-07T19:48:14.2590565Z 2025-05-07T19:48:14.2590571Z 2025-05-07T19:48:14.2590576Z 2025-05-07T19:48:14.2590779Z  2025-05-07T19:48:14.2591050Z 2025-05-07T19:48:14.2591055Z 2025-05-07T19:48:14.2591060Z 2025-05-07T19:48:14.2591065Z 2025-05-07T19:48:14.2591071Z 2025-05-07T19:48:14.2591076Z 2025-05-07T19:48:14.2591090Z 2025-05-07T19:48:14.2591104Z 2025-05-07T19:48:14.2591109Z 2025-05-07T19:48:14.2591114Z 2025-05-07T19:48:14.2591119Z 2025-05-07T19:48:14.2591125Z 2025-05-07T19:48:14.2591129Z 2025-05-07T19:48:14.2591134Z 2025-05-07T19:48:14.2591139Z 2025-05-07T19:48:14.2591144Z 2025-05-07T19:48:14.2591149Z 2025-05-07T19:48:14.2591154Z 2025-05-07T19:48:14.2591365Z  2025-05-07T19:48:14.2591656Z 2025-05-07T19:48:14.2591662Z 2025-05-07T19:48:14.2591788Z  2025-05-07T19:48:14.2592148Z 2025-05-07T19:48:14.2592154Z 2025-05-07T19:48:14.2592291Z  2025-05-07T19:48:14.2592438Z 2025-05-07T19:48:14.2592443Z 2025-05-07T19:48:14.2592449Z 2025-05-07T19:48:14.2592586Z  2025-05-07T19:48:14.2592723Z 2025-05-07T19:48:14.2592728Z 2025-05-07T19:48:14.2592740Z 2025-05-07T19:48:14.2592746Z 2025-05-07T19:48:14.2592882Z  2025-05-07T19:48:14.2593031Z 2025-05-07T19:48:14.2593037Z 2025-05-07T19:48:14.2593042Z 2025-05-07T19:48:14.2593046Z 2025-05-07T19:48:14.2593059Z 2025-05-07T19:48:14.2593206Z  2025-05-07T19:48:14.2593364Z 2025-05-07T19:48:14.2593370Z 2025-05-07T19:48:14.2593375Z 2025-05-07T19:48:14.2593380Z 2025-05-07T19:48:14.2593385Z 2025-05-07T19:48:14.2593390Z 2025-05-07T19:48:14.2593596Z  2025-05-07T19:48:14.2593756Z 2025-05-07T19:48:14.2593761Z 2025-05-07T19:48:14.2593766Z 2025-05-07T19:48:14.2593772Z 2025-05-07T19:48:14.2593777Z 2025-05-07T19:48:14.2593782Z 2025-05-07T19:48:14.2593787Z 2025-05-07T19:48:14.2593953Z  2025-05-07T19:48:14.2594139Z 2025-05-07T19:48:14.2594145Z 2025-05-07T19:48:14.2594150Z 2025-05-07T19:48:14.2594155Z 2025-05-07T19:48:14.2594160Z 2025-05-07T19:48:14.2594165Z 2025-05-07T19:48:14.2594170Z 2025-05-07T19:48:14.2594175Z 2025-05-07T19:48:14.2594329Z  2025-05-07T19:48:14.2594525Z 2025-05-07T19:48:14.2594531Z 2025-05-07T19:48:14.2594536Z 2025-05-07T19:48:14.2594541Z 2025-05-07T19:48:14.2594547Z 2025-05-07T19:48:14.2594663Z 2025-05-07T19:48:14.2594669Z 2025-05-07T19:48:14.2594674Z 2025-05-07T19:48:14.2594680Z 2025-05-07T19:48:14.2594848Z  2025-05-07T19:48:14.2595058Z 2025-05-07T19:48:14.2595063Z 2025-05-07T19:48:14.2595068Z 2025-05-07T19:48:14.2595074Z 2025-05-07T19:48:14.2595079Z 2025-05-07T19:48:14.2595084Z 2025-05-07T19:48:14.2595089Z 2025-05-07T19:48:14.2595094Z 2025-05-07T19:48:14.2595100Z 2025-05-07T19:48:14.2595105Z 2025-05-07T19:48:14.2595269Z  2025-05-07T19:48:14.2595573Z 2025-05-07T19:48:14.2595579Z 2025-05-07T19:48:14.2595584Z 2025-05-07T19:48:14.2595589Z 2025-05-07T19:48:14.2595594Z 2025-05-07T19:48:14.2595599Z 2025-05-07T19:48:14.2595604Z 2025-05-07T19:48:14.2595610Z 2025-05-07T19:48:14.2595615Z 2025-05-07T19:48:14.2595620Z 2025-05-07T19:48:14.2595625Z 2025-05-07T19:48:14.2595798Z  2025-05-07T19:48:14.2596031Z 2025-05-07T19:48:14.2596036Z 2025-05-07T19:48:14.2596042Z 2025-05-07T19:48:14.2596054Z 2025-05-07T19:48:14.2596059Z 2025-05-07T19:48:14.2596064Z 2025-05-07T19:48:14.2596069Z 2025-05-07T19:48:14.2596074Z 2025-05-07T19:48:14.2596080Z 2025-05-07T19:48:14.2596085Z 2025-05-07T19:48:14.2596090Z 2025-05-07T19:48:14.2596095Z 2025-05-07T19:48:14.2596266Z  2025-05-07T19:48:14.2596504Z 2025-05-07T19:48:14.2596509Z 2025-05-07T19:48:14.2596514Z 2025-05-07T19:48:14.2596519Z 2025-05-07T19:48:14.2596524Z 2025-05-07T19:48:14.2596530Z 2025-05-07T19:48:14.2596535Z 2025-05-07T19:48:14.2596546Z 2025-05-07T19:48:14.2596551Z 2025-05-07T19:48:14.2596556Z 2025-05-07T19:48:14.2596561Z 2025-05-07T19:48:14.2596566Z 2025-05-07T19:48:14.2596572Z 2025-05-07T19:48:14.2596756Z  2025-05-07T19:48:14.2597006Z 2025-05-07T19:48:14.2597011Z 2025-05-07T19:48:14.2597016Z 2025-05-07T19:48:14.2597021Z 2025-05-07T19:48:14.2597025Z 2025-05-07T19:48:14.2597031Z 2025-05-07T19:48:14.2597036Z 2025-05-07T19:48:14.2597041Z 2025-05-07T19:48:14.2597053Z 2025-05-07T19:48:14.2597059Z 2025-05-07T19:48:14.2597064Z 2025-05-07T19:48:14.2597069Z 2025-05-07T19:48:14.2597074Z 2025-05-07T19:48:14.2597080Z 2025-05-07T19:48:14.2597272Z  2025-05-07T19:48:14.2597521Z 2025-05-07T19:48:14.2597527Z 2025-05-07T19:48:14.2597532Z 2025-05-07T19:48:14.2597537Z 2025-05-07T19:48:14.2597542Z 2025-05-07T19:48:14.2597547Z 2025-05-07T19:48:14.2597552Z 2025-05-07T19:48:14.2597558Z 2025-05-07T19:48:14.2597563Z 2025-05-07T19:48:14.2597574Z 2025-05-07T19:48:14.2597579Z 2025-05-07T19:48:14.2597584Z 2025-05-07T19:48:14.2597589Z 2025-05-07T19:48:14.2597595Z 2025-05-07T19:48:14.2597600Z 2025-05-07T19:48:14.2597798Z  2025-05-07T19:48:14.2598028Z 2025-05-07T19:48:14.2598032Z 2025-05-07T19:48:14.2598036Z 2025-05-07T19:48:14.2598039Z 2025-05-07T19:48:14.2598043Z 2025-05-07T19:48:14.2598047Z 2025-05-07T19:48:14.2598050Z 2025-05-07T19:48:14.2598054Z 2025-05-07T19:48:14.2598100Z 2025-05-07T19:48:14.2598104Z 2025-05-07T19:48:14.2598107Z 2025-05-07T19:48:14.2598111Z 2025-05-07T19:48:14.2598114Z 2025-05-07T19:48:14.2598118Z 2025-05-07T19:48:14.2598122Z 2025-05-07T19:48:14.2598125Z 2025-05-07T19:48:14.2598277Z  2025-05-07T19:48:14.2598476Z 2025-05-07T19:48:14.2598480Z 2025-05-07T19:48:14.2598483Z 2025-05-07T19:48:14.2598487Z 2025-05-07T19:48:14.2598491Z 2025-05-07T19:48:14.2598494Z 2025-05-07T19:48:14.2598498Z 2025-05-07T19:48:14.2598506Z 2025-05-07T19:48:14.2598509Z 2025-05-07T19:48:14.2598513Z 2025-05-07T19:48:14.2598517Z 2025-05-07T19:48:14.2598520Z 2025-05-07T19:48:14.2598524Z 2025-05-07T19:48:14.2598528Z 2025-05-07T19:48:14.2598531Z 2025-05-07T19:48:14.2598535Z 2025-05-07T19:48:14.2598539Z 2025-05-07T19:48:14.2598693Z  2025-05-07T19:48:14.2598892Z 2025-05-07T19:48:14.2598896Z 2025-05-07T19:48:14.2598899Z 2025-05-07T19:48:14.2598903Z 2025-05-07T19:48:14.2599050Z 2025-05-07T19:48:14.2599053Z 2025-05-07T19:48:14.2599057Z 2025-05-07T19:48:14.2599061Z 2025-05-07T19:48:14.2599064Z 2025-05-07T19:48:14.2599068Z 2025-05-07T19:48:14.2599071Z 2025-05-07T19:48:14.2599075Z 2025-05-07T19:48:14.2599079Z 2025-05-07T19:48:14.2599082Z 2025-05-07T19:48:14.2599086Z 2025-05-07T19:48:14.2599090Z 2025-05-07T19:48:14.2599101Z 2025-05-07T19:48:14.2599105Z 2025-05-07T19:48:14.2599265Z  2025-05-07T19:48:14.2599466Z 2025-05-07T19:48:14.2599542Z 2025-05-07T19:48:14.2599636Z  2025-05-07T19:48:14.2599743Z 2025-05-07T19:48:14.2599747Z 2025-05-07T19:48:14.2599841Z  2025-05-07T19:48:14.2599940Z 2025-05-07T19:48:14.2599944Z 2025-05-07T19:48:14.2599948Z 2025-05-07T19:48:14.2600050Z  2025-05-07T19:48:14.2600152Z 2025-05-07T19:48:14.2600156Z 2025-05-07T19:48:14.2600160Z 2025-05-07T19:48:14.2600163Z 2025-05-07T19:48:14.2600260Z  2025-05-07T19:48:14.2600378Z 2025-05-07T19:48:14.2600388Z 2025-05-07T19:48:14.2600391Z 2025-05-07T19:48:14.2600395Z 2025-05-07T19:48:14.2600399Z 2025-05-07T19:48:14.2600500Z  2025-05-07T19:48:14.2600616Z 2025-05-07T19:48:14.2600626Z 2025-05-07T19:48:14.2600629Z 2025-05-07T19:48:14.2600633Z 2025-05-07T19:48:14.2600637Z 2025-05-07T19:48:14.2600640Z 2025-05-07T19:48:14.2600744Z  2025-05-07T19:48:14.2600865Z 2025-05-07T19:48:14.2600869Z 2025-05-07T19:48:14.2600873Z 2025-05-07T19:48:14.2600876Z 2025-05-07T19:48:14.2600880Z 2025-05-07T19:48:14.2600888Z 2025-05-07T19:48:14.2600898Z 2025-05-07T19:48:14.2601004Z  2025-05-07T19:48:14.2601137Z 2025-05-07T19:48:14.2601141Z 2025-05-07T19:48:14.2601144Z 2025-05-07T19:48:14.2601148Z 2025-05-07T19:48:14.2601152Z 2025-05-07T19:48:14.2601155Z 2025-05-07T19:48:14.2601159Z 2025-05-07T19:48:14.2601163Z 2025-05-07T19:48:14.2601287Z  2025-05-07T19:48:14.2601436Z 2025-05-07T19:48:14.2601440Z 2025-05-07T19:48:14.2601444Z 2025-05-07T19:48:14.2601452Z 2025-05-07T19:48:14.2601456Z 2025-05-07T19:48:14.2601459Z 2025-05-07T19:48:14.2601463Z 2025-05-07T19:48:14.2601467Z 2025-05-07T19:48:14.2601470Z 2025-05-07T19:48:14.2603185Z  2025-05-07T19:48:14.6854924Z Preparing transaction: ...working... done 2025-05-07T19:48:18.3989865Z Verifying transaction: ...working... done 2025-05-07T19:48:24.3380599Z Executing transaction: ...working... done 2025-05-07T19:48:27.2137153Z [INSTALL] Adding symlink librhash.so.0, which is needed by CMake ... 2025-05-07T19:48:28.2009241Z + ln -s /__w/_temp/conda_environment_14891846312/lib/librhash.so /__w/_temp/conda_environment_14891846312/lib/librhash.so.0 2025-05-07T19:48:28.2009728Z 2025-05-07T19:48:28.2041202Z 2025-05-07T19:48:28.2082047Z [EXEC] [ATTEMPT 0/3] + conda run -p /__w/_temp/conda_environment_14891846312 pip install build 2025-05-07T19:48:29.9297024Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2025-05-07T19:48:29.9298431Z Collecting build 2025-05-07T19:48:29.9298763Z Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB) 2025-05-07T19:48:29.9299492Z Requirement already satisfied: packaging>=19.1 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from build) (25.0) 2025-05-07T19:48:29.9300132Z Collecting pyproject_hooks (from build) 2025-05-07T19:48:29.9300546Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB) 2025-05-07T19:48:29.9300993Z Collecting importlib-metadata>=4.6 (from build) 2025-05-07T19:48:29.9301450Z Downloading importlib_metadata-8.7.0-py3-none-any.whl.metadata (4.8 kB) 2025-05-07T19:48:29.9302180Z Requirement already satisfied: tomli>=1.1.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from build) (2.2.1) 2025-05-07T19:48:29.9303294Z Collecting zipp>=3.20 (from importlib-metadata>=4.6->build) 2025-05-07T19:48:29.9303732Z Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB) 2025-05-07T19:48:29.9304163Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB) 2025-05-07T19:48:29.9304587Z Downloading importlib_metadata-8.7.0-py3-none-any.whl (27 kB) 2025-05-07T19:48:29.9304999Z Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB) 2025-05-07T19:48:29.9305563Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB) 2025-05-07T19:48:29.9306070Z Installing collected packages: zipp, pyproject_hooks, importlib-metadata, build 2025-05-07T19:48:29.9306422Z 2025-05-07T19:48:29.9306436Z 2025-05-07T19:48:29.9306757Z Successfully installed build-1.2.2.post1 importlib-metadata-8.7.0 pyproject_hooks-1.2.0 zipp-3.21.0 2025-05-07T19:48:29.9307176Z 2025-05-07T19:48:30.9133367Z /__w/_temp/conda_environment_14891846312/bin/make 2025-05-07T19:48:30.9133653Z 2025-05-07T19:48:30.9558425Z [CHECK] Binary make found in PATH 2025-05-07T19:48:31.9003571Z /__w/_temp/conda_environment_14891846312/bin/cmake 2025-05-07T19:48:31.9003927Z 2025-05-07T19:48:31.9428026Z [CHECK] Binary cmake found in PATH 2025-05-07T19:48:32.8876190Z /__w/_temp/conda_environment_14891846312/bin/ninja 2025-05-07T19:48:32.8876466Z 2025-05-07T19:48:32.9304862Z [CHECK] Binary ninja found in PATH 2025-05-07T19:48:33.9498837Z [CHECK] Python (sub-)package 'click' found ... 2025-05-07T19:48:35.0972948Z [CHECK] Python (sub-)package 'hypothesis' found ... 2025-05-07T19:48:36.1412624Z [CHECK] Python (sub-)package 'jinja2' found ... 2025-05-07T19:48:37.2756767Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T19:48:38.2730839Z [CHECK] Python (sub-)package 'wheel' found ... 2025-05-07T19:48:38.2734797Z [INSTALL] Successfully installed all the build tools 2025-05-07T19:48:38.2767874Z [NOVA] Time taken to install Build Tools: 61 seconds 2025-05-07T19:48:38.2768350Z ################################################################################ 2025-05-07T19:48:38.2768962Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T19:48:38.2769349Z # 2025-05-07T19:48:38.2798972Z # [2025-05-07T19:48:38.279Z] + collect_pytorch_env_info /__w/_temp/conda_environment_14891846312 2025-05-07T19:48:38.2799615Z ################################################################################ 2025-05-07T19:48:38.2799919Z 2025-05-07T19:48:38.2837545Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T19:48:38.4074086Z [CHECK] Network does not appear to be blocked. 2025-05-07T19:48:38.4085064Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T19:48:38.4085895Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T19:48:38.4086354Z 2025-05-07T19:48:38.5351196Z 2025-05-07T19:48:38.5351772Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T19:48:38.5391626Z [EXEC] [ATTEMPT 0/3] + conda run -p /__w/_temp/conda_environment_14891846312 python collect_env.py 2025-05-07T19:48:43.4153043Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/_subclasses/functional_tensor.py:279: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) 2025-05-07T19:48:43.4154187Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-05-07T19:48:43.4154447Z 2025-05-07T19:48:43.4154616Z Collecting environment information... 2025-05-07T19:48:43.4154922Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T19:48:43.4155197Z Is debug build: False 2025-05-07T19:48:43.4155437Z CUDA used to build PyTorch: 12.8 2025-05-07T19:48:43.4155701Z ROCM used to build PyTorch: N/A 2025-05-07T19:48:43.4155872Z 2025-05-07T19:48:43.4155994Z OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64) 2025-05-07T19:48:43.4156330Z GCC version: (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9) 2025-05-07T19:48:43.4156962Z Clang version: Could not collect 2025-05-07T19:48:43.4157222Z CMake version: version 4.0.2 2025-05-07T19:48:43.4157472Z Libc version: glibc-2.28 2025-05-07T19:48:43.4157622Z 2025-05-07T19:48:43.4157920Z Python version: 3.9.22 | packaged by conda-forge | (main, Apr 14 2025, 23:35:59) [GCC 13.3.0] (64-bit runtime) 2025-05-07T19:48:43.4158513Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.28 2025-05-07T19:48:43.4158914Z Is CUDA available: True 2025-05-07T19:48:43.4159301Z CUDA runtime version: 12.8.61 2025-05-07T19:48:43.4159567Z CUDA_MODULE_LOADING set to: LAZY 2025-05-07T19:48:43.4159860Z GPU models and configuration: GPU 0: NVIDIA A10G 2025-05-07T19:48:43.4160182Z Nvidia driver version: 570.133.07 2025-05-07T19:48:43.4160449Z cuDNN version: Could not collect 2025-05-07T19:48:43.4160709Z HIP runtime version: N/A 2025-05-07T19:48:43.4160953Z MIOpen runtime version: N/A 2025-05-07T19:48:43.4161198Z Is XNNPACK available: True 2025-05-07T19:48:43.4161355Z 2025-05-07T19:48:43.4161436Z CPU: 2025-05-07T19:48:43.4161618Z Architecture: x86_64 2025-05-07T19:48:43.4161873Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T19:48:43.4162139Z Byte Order: Little Endian 2025-05-07T19:48:43.4162394Z CPU(s): 16 2025-05-07T19:48:43.4162613Z On-line CPU(s) list: 0-15 2025-05-07T19:48:43.4162850Z Thread(s) per core: 2 2025-05-07T19:48:43.4163070Z Core(s) per socket: 8 2025-05-07T19:48:43.4163288Z Socket(s): 1 2025-05-07T19:48:43.4163501Z NUMA node(s): 1 2025-05-07T19:48:43.4163729Z Vendor ID: AuthenticAMD 2025-05-07T19:48:43.4163981Z CPU family: 23 2025-05-07T19:48:43.4164193Z Model: 49 2025-05-07T19:48:43.4164425Z Model name: AMD EPYC 7R32 2025-05-07T19:48:43.4164674Z Stepping: 0 2025-05-07T19:48:43.4164896Z CPU MHz: 3300.537 2025-05-07T19:48:43.4165131Z BogoMIPS: 5599.99 2025-05-07T19:48:43.4165376Z Hypervisor vendor: KVM 2025-05-07T19:48:43.4165614Z Virtualization type: full 2025-05-07T19:48:43.4165864Z L1d cache: 32K 2025-05-07T19:48:43.4166081Z L1i cache: 32K 2025-05-07T19:48:43.4166302Z L2 cache: 512K 2025-05-07T19:48:43.4166525Z L3 cache: 16384K 2025-05-07T19:48:43.4166755Z NUMA node0 CPU(s): 0-15 2025-05-07T19:48:43.4168787Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T19:48:43.4170740Z 2025-05-07T19:48:43.4170834Z Versions of relevant libraries: 2025-05-07T19:48:43.4171107Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T19:48:43.4171402Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T19:48:43.4171701Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T19:48:43.4171998Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T19:48:43.4172300Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T19:48:43.4172570Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T19:48:43.4172846Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T19:48:43.4173126Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T19:48:43.4173426Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T19:48:43.4173712Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T19:48:43.4173997Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T19:48:43.4174266Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T19:48:43.4174552Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T19:48:43.4174827Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T19:48:43.4175110Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T19:48:43.4175478Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T19:48:43.4176071Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T19:48:43.4176562Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T19:48:43.4177044Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T19:48:43.4177549Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T19:48:43.4178090Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T19:48:43.4178549Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T19:48:43.4179018Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T19:48:43.4179489Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T19:48:43.4179971Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T19:48:43.4188351Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T19:48:43.4188858Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T19:48:43.4189343Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T19:48:43.4189820Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T19:48:43.4190270Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T19:48:43.4190541Z 2025-05-07T19:48:43.4622659Z [NOVA] Time taken to collect PyTorch environment information: 5 seconds 2025-05-07T19:48:43.4623117Z [NOVA] Setting the FBGEMM build target: genai ... 2025-05-07T19:48:43.4633274Z [INSTALL] Set environment variables LD_LIBRARY_PATH ... 2025-05-07T19:48:43.4634564Z + conda env config vars set -p /__w/_temp/conda_environment_14891846312 LD_LIBRARY_PATH=/usr/local/lib:/usr/local/cuda-12.8/lib64:/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: CUDNN_INCLUDE_DIR=/usr/local/cuda-12.8/include CUDNN_LIBRARY=/usr/local/cuda-12.8/lib64 2025-05-07T19:48:43.4635651Z 2025-05-07T19:48:43.9407450Z To make your changes take effect please reactivate your environment 2025-05-07T19:48:43.9837411Z 2025-05-07T19:48:43.9837611Z [NOVA] -------- Finding libcuda.so ----------- 2025-05-07T19:48:44.0077789Z + ln /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libcuda.so -s /usr/local/lib/libcuda.so.1 2025-05-07T19:48:44.0078200Z 2025-05-07T19:48:44.0109013Z 2025-05-07T19:48:44.0109229Z [NOVA] -------- Finding NVML_LIB_PATH ----------- 2025-05-07T19:48:44.0318373Z [NOVA] looking in /usr/local/cuda-12.8 2025-05-07T19:48:44.0318848Z [NOVA] NVML_LIB_PATH = /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:48:44.0319306Z [NOVA] ------------------------------------------ 2025-05-07T19:48:44.0348796Z [NOVA] Time taken to find NVML_LIB_PATH: 1 seconds 2025-05-07T19:48:44.0349491Z [NOVA] Setting the FBGEMM build variant: cuda ... 2025-05-07T19:48:44.0350503Z ################################################################################ 2025-05-07T19:48:44.0350944Z # Prepare FBGEMM-GPU Build 2025-05-07T19:48:44.0351221Z # 2025-05-07T19:48:44.0381157Z # [2025-05-07T19:48:44.037Z] + prepare_fbgemm_gpu_build /__w/_temp/conda_environment_14891846312 2025-05-07T19:48:44.0381808Z ################################################################################ 2025-05-07T19:48:44.0382021Z 2025-05-07T19:48:44.0411837Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T19:48:44.1663613Z [CHECK] Network does not appear to be blocked. 2025-05-07T19:48:44.1688613Z [BUILD] Running git submodules update ... 2025-05-07T19:48:44.1724488Z [EXEC] [ATTEMPT 0/3] + git submodule sync 2025-05-07T19:48:44.2246482Z Synchronizing submodule url for '../external/asmjit' 2025-05-07T19:48:44.2247062Z Synchronizing submodule url for '../external/composable_kernel' 2025-05-07T19:48:44.2247485Z Synchronizing submodule url for '../external/cpuinfo' 2025-05-07T19:48:44.2248117Z Synchronizing submodule url for '../external/cutlass' 2025-05-07T19:48:44.2248497Z Synchronizing submodule url for '../external/googletest' 2025-05-07T19:48:44.2248910Z Synchronizing submodule url for '../external/hipify_torch' 2025-05-07T19:48:44.2249302Z Synchronizing submodule url for '../external/json' 2025-05-07T19:48:44.2298206Z [EXEC] [ATTEMPT 0/3] + git submodule update --init --recursive 2025-05-07T19:48:44.3042150Z [BUILD] Installing other build dependencies ... 2025-05-07T19:48:44.3079236Z [EXEC] [ATTEMPT 0/3] + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m pip install -r requirements.txt 2025-05-07T19:48:45.2112087Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:45.2112624Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:45.6143285Z Collecting backports.tarfile (from -r requirements.txt (line 13)) 2025-05-07T19:48:45.6267617Z Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB) 2025-05-07T19:48:45.6362786Z Requirement already satisfied: build in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 14)) (1.2.2.post1) 2025-05-07T19:48:45.7897332Z Collecting cmake (from -r requirements.txt (line 15)) 2025-05-07T19:48:45.7952350Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB) 2025-05-07T19:48:45.8457476Z Requirement already satisfied: click in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 16)) (8.1.8) 2025-05-07T19:48:45.8461013Z Requirement already satisfied: hypothesis in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 17)) (6.131.14) 2025-05-07T19:48:45.8464394Z Requirement already satisfied: jinja2 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 18)) (3.1.4) 2025-05-07T19:48:45.8468731Z Requirement already satisfied: mpmath==1.3.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 19)) (1.3.0) 2025-05-07T19:48:45.9182223Z Collecting ninja (from -r requirements.txt (line 20)) 2025-05-07T19:48:45.9223479Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB) 2025-05-07T19:48:46.1493299Z Collecting numpy>=2.0.2 (from -r requirements.txt (line 21)) 2025-05-07T19:48:46.1537710Z Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB) 2025-05-07T19:48:46.2909169Z Collecting pyre-extensions (from -r requirements.txt (line 22)) 2025-05-07T19:48:46.2949196Z Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB) 2025-05-07T19:48:46.3259174Z Requirement already satisfied: pyyaml in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 23)) (6.0.2) 2025-05-07T19:48:46.3262721Z Requirement already satisfied: scikit-build in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 24)) (0.18.1) 2025-05-07T19:48:46.3266220Z Requirement already satisfied: setuptools in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from -r requirements.txt (line 25)) (78.1.1) 2025-05-07T19:48:46.3936438Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26)) 2025-05-07T19:48:46.3975215Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB) 2025-05-07T19:48:46.4290599Z Collecting tabulate (from -r requirements.txt (line 27)) 2025-05-07T19:48:46.4332622Z Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB) 2025-05-07T19:48:46.4742571Z Collecting patchelf (from -r requirements.txt (line 28)) 2025-05-07T19:48:46.4785416Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB) 2025-05-07T19:48:46.5314195Z Requirement already satisfied: packaging>=19.1 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from build->-r requirements.txt (line 14)) (25.0) 2025-05-07T19:48:46.5318674Z Requirement already satisfied: pyproject_hooks in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from build->-r requirements.txt (line 14)) (1.2.0) 2025-05-07T19:48:46.5326079Z Requirement already satisfied: importlib-metadata>=4.6 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from build->-r requirements.txt (line 14)) (8.7.0) 2025-05-07T19:48:46.5333742Z Requirement already satisfied: tomli>=1.1.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from build->-r requirements.txt (line 14)) (2.2.1) 2025-05-07T19:48:46.5462125Z Requirement already satisfied: attrs>=22.2.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from hypothesis->-r requirements.txt (line 17)) (25.3.0) 2025-05-07T19:48:46.5469358Z Requirement already satisfied: exceptiongroup>=1.0.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from hypothesis->-r requirements.txt (line 17)) (1.2.2) 2025-05-07T19:48:46.5475172Z Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from hypothesis->-r requirements.txt (line 17)) (2.4.0) 2025-05-07T19:48:46.5493493Z Requirement already satisfied: MarkupSafe>=2.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5) 2025-05-07T19:48:46.6108591Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T19:48:46.6147247Z Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB) 2025-05-07T19:48:46.6654905Z Requirement already satisfied: typing-extensions in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.12.2) 2025-05-07T19:48:46.6703510Z Requirement already satisfied: distro in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (1.9.0) 2025-05-07T19:48:46.6713577Z Requirement already satisfied: wheel>=0.32.0 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1) 2025-05-07T19:48:46.6951616Z Requirement already satisfied: zipp>=3.20 in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from importlib-metadata>=4.6->build->-r requirements.txt (line 14)) (3.21.0) 2025-05-07T19:48:46.7605614Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22)) 2025-05-07T19:48:46.7645580Z Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB) 2025-05-07T19:48:46.8132040Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB) 2025-05-07T19:48:46.8714223Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB) 2025-05-07T19:48:47.3802681Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 54.8 MB/s eta 0:00:00 2025-05-07T19:48:47.3849243Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB) 2025-05-07T19:48:47.4058167Z Downloading numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB) 2025-05-07T19:48:47.7523378Z ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.5/19.5 MB 56.0 MB/s eta 0:00:00 2025-05-07T19:48:47.7568034Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB) 2025-05-07T19:48:47.8201733Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB) 2025-05-07T19:48:47.8851466Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB) 2025-05-07T19:48:47.9300437Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB) 2025-05-07T19:48:47.9745313Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-05-07T19:48:48.0255296Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-05-07T19:48:48.2594252Z Installing collected packages: tabulate, setuptools_git_versioning, patchelf, numpy, ninja, mypy-extensions, cmake, backports.tarfile, typing-inspect, pyre-extensions 2025-05-07T19:48:51.5261694Z 2025-05-07T19:48:51.5340845Z Successfully installed backports.tarfile-1.2.0 cmake-4.0.0 mypy-extensions-1.1.0 ninja-1.11.1.4 numpy-2.0.2 patchelf-0.17.2.2 pyre-extensions-0.0.32 setuptools_git_versioning-2.1.0 tabulate-0.9.0 typing-inspect-0.9.0 2025-05-07T19:48:51.5344892Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2025-05-07T19:48:51.6938486Z ################################################################################ 2025-05-07T19:48:51.6938823Z # Install PyTorch (PyTorch PIP) 2025-05-07T19:48:51.6939091Z # 2025-05-07T19:48:51.6972716Z # [2025-05-07T19:48:51.696Z] + install_triton_pip /__w/_temp/conda_environment_14891846312 2025-05-07T19:48:51.6973178Z ################################################################################ 2025-05-07T19:48:51.6973400Z 2025-05-07T19:48:52.7303425Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:52.7303848Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:52.7304059Z 2025-05-07T19:48:52.7729551Z [CHECK] Python (sub-)package 'numpy' found ... 2025-05-07T19:48:53.9430286Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:53.9430710Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:53.9430916Z 2025-05-07T19:48:53.9876978Z [CHECK] Python (sub-)package 'skbuild' found ... 2025-05-07T19:48:53.9878409Z [BUILD] Successfully ran git submodules update 2025-05-07T19:48:53.9928387Z [NOVA] Time taken to prepare the build : 9 seconds / 00:00:09 2025-05-07T19:48:53.9979564Z [BUILD] BUILD_TARGET_VARIANT: genai/cuda 2025-05-07T19:48:53.9980043Z [BUILD] Extracted build target: genai 2025-05-07T19:48:53.9980444Z [BUILD] Extracted build variant: cuda 2025-05-07T19:48:54.9486573Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:54.9487004Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:54.9487211Z 2025-05-07T19:48:54.9487325Z /opt/rh/gcc-toolset-11/root/usr/bin/cc 2025-05-07T19:48:54.9487535Z 2025-05-07T19:48:54.9922412Z [CHECK] Binary cc found in PATH 2025-05-07T19:48:55.9349997Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:55.9350473Z /opt/rh/gcc-toolset-11/root/usr/bin/gcc 2025-05-07T19:48:55.9350677Z 2025-05-07T19:48:55.9350795Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:55.9351006Z 2025-05-07T19:48:55.9771588Z [CHECK] Binary gcc found in PATH 2025-05-07T19:48:56.9209355Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:56.9209770Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:56.9210008Z 2025-05-07T19:48:56.9210113Z /opt/rh/gcc-toolset-11/root/usr/bin/c++ 2025-05-07T19:48:56.9210308Z 2025-05-07T19:48:56.9642510Z [CHECK] Binary c++ found in PATH 2025-05-07T19:48:57.9063265Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:57.9063810Z /opt/rh/gcc-toolset-11/root/usr/bin/g++ 2025-05-07T19:48:57.9064137Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:57.9064352Z 2025-05-07T19:48:57.9064356Z 2025-05-07T19:48:57.9486242Z [CHECK] Binary g++ found in PATH 2025-05-07T19:48:58.8487182Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:58.8488274Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:48:58.9315381Z [BUILD] Extracted and set Python tag: py39 2025-05-07T19:48:58.9315982Z [BUILD] Extracted and set Python platform name: manylinux_2_28_x86_64 2025-05-07T19:48:58.9382850Z core = 8 2025-05-07T19:48:58.9445930Z sockets = 1 2025-05-07T19:48:58.9447236Z [BUILD] Set multicore run option for setup.py: -j 8 2025-05-07T19:48:58.9448334Z [CHECK] LD_LIBRARY_PATH = /opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: 2025-05-07T19:48:58.9448838Z [BUILD] Running pre-build cleanups ... 2025-05-07T19:48:58.9449112Z + rm -rf dist 2025-05-07T19:48:58.9449232Z 2025-05-07T19:48:58.9477808Z 2025-05-07T19:48:58.9478274Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python setup.py clean 2025-05-07T19:48:58.9478717Z 2025-05-07T19:48:59.8502753Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:48:59.8503175Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:01.3787973Z INFO:root:running clean 2025-05-07T19:49:01.3789586Z [SETUP.PY] ARGV: ['setup.py', 'clean'] 2025-05-07T19:49:01.3790547Z [SETUP.PY] Parsed setup.py arguments: Namespace(verbose=False, debug=False, dryrun=False, build_target='default', build_variant='cuda', package_channel='nightly', nvml_lib_path=None, nccl_lib_path=None, use_fb_only=False, cxxprefix=None) 2025-05-07T19:49:01.3791531Z [SETUP.PY] Other arguments: ['clean'] 2025-05-07T19:49:01.3792133Z [SETUP.PY] CUDA CUB directory environment variable not set. Using default CUB location. 2025-05-07T19:49:01.3792609Z [SETUP.PY] Using CUDA = /usr/local/cuda-12.8 2025-05-07T19:49:01.3793157Z [SETUP.PY] Generating version file at: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/fbgemm_gpu/docs/version.py 2025-05-07T19:49:01.3793746Z [SETUP.PY] Setting the FBGEMM build target: default ... 2025-05-07T19:49:01.3794126Z [SETUP.PY] Setting the FBGEMM build variant: cuda ... 2025-05-07T19:49:01.3795225Z [SETUP.PY] Passing CMake arguments: ['-DCMAKE_PREFIX_PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch', '-D_GLIBCXX_USE_CXX11_ABI=1', '-DFBGEMM_BUILD_TARGET=default', '-DFBGEMM_BUILD_VARIANT=cuda', "-DCMAKE_C_FLAGS=''", "-DCMAKE_CXX_FLAGS=''"] 2025-05-07T19:49:01.7722587Z 2025-05-07T19:49:01.7722933Z [BUILD] Printing git status ... 2025-05-07T19:49:01.7723228Z + git status 2025-05-07T19:49:01.7723369Z 2025-05-07T19:49:02.5778623Z HEAD detached at pull/4066/merge 2025-05-07T19:49:02.5779007Z Untracked files: 2025-05-07T19:49:02.5779287Z (use "git add ..." to include in what will be committed) 2025-05-07T19:49:02.5779617Z ../collect_env.py 2025-05-07T19:49:02.5779836Z fbgemm_gpu/docs/version.py 2025-05-07T19:49:02.5779999Z 2025-05-07T19:49:02.5780223Z nothing added to commit but untracked files present (use "git add" to track) 2025-05-07T19:49:02.5782374Z 2025-05-07T19:49:02.5782686Z + git diff 2025-05-07T19:49:02.5782848Z 2025-05-07T19:49:02.6196123Z 2025-05-07T19:49:02.6197012Z ################################################################################ 2025-05-07T19:49:02.6197631Z # Configure FBGEMM-GPU Build 2025-05-07T19:49:02.6197912Z # 2025-05-07T19:49:02.6227910Z # [2025-05-07T19:49:02.622Z] + __configure_fbgemm_gpu_build 2025-05-07T19:49:02.6228275Z ################################################################################ 2025-05-07T19:49:02.6228484Z 2025-05-07T19:49:02.6237370Z [BUILD] Setting the build target: genai ... 2025-05-07T19:49:02.6237800Z [BUILD] Configuring build as CUDA variant (this is the default behavior) ... 2025-05-07T19:49:03.5662598Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:03.5663389Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:03.5663781Z 2025-05-07T19:49:03.5663969Z /usr/local/cuda-12.8/bin/nvcc 2025-05-07T19:49:03.5664313Z 2025-05-07T19:49:03.6100369Z [CHECK] Binary nvcc found in PATH 2025-05-07T19:49:04.5507605Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:04.5508192Z /usr/local/cuda-12.8/include 2025-05-07T19:49:04.5508437Z 2025-05-07T19:49:04.5508606Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:04.5508922Z 2025-05-07T19:49:04.5936974Z [CHECK] Environment variable CUDNN_INCLUDE_DIR is defined in the Conda environment 2025-05-07T19:49:05.5363182Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:05.5363573Z /usr/local/cuda-12.8/lib64 2025-05-07T19:49:05.5364085Z 2025-05-07T19:49:05.5364212Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:05.5364422Z 2025-05-07T19:49:05.5792789Z [CHECK] Environment variable CUDNN_LIBRARY is defined in the Conda environment 2025-05-07T19:49:06.5243086Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:06.5243590Z /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:06.5243992Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:06.5244547Z 2025-05-07T19:49:06.5244555Z 2025-05-07T19:49:06.5673584Z [CHECK] Environment variable NVML_LIB_PATH is defined in the Conda environment 2025-05-07T19:49:06.5678576Z [BUILD] Using the environment-supplied TORCH_CUDA_ARCH_LIST as the CUDA targets ... 2025-05-07T19:49:06.5679348Z [BUILD] Setting the following CUDA targets: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T19:49:06.5679881Z [BUILD] Looking up NVML filepath ... 2025-05-07T19:49:07.4654490Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:07.4654930Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:07.5487308Z [BUILD] Looking up NCCL filepath ... 2025-05-07T19:49:08.4874795Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:08.4875203Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:08.4875422Z 2025-05-07T19:49:09.5398302Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:09.5398938Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:09.5399277Z 2025-05-07T19:49:09.5826701Z [BUILD] Setting NVCC verbose mode ... 2025-05-07T19:49:09.5827165Z + conda env config vars set -p /__w/_temp/conda_environment_14891846312 NVCC_VERBOSE=1 2025-05-07T19:49:09.5827502Z 2025-05-07T19:49:10.0589013Z To make your changes take effect please reactivate your environment 2025-05-07T19:49:10.1008892Z 2025-05-07T19:49:10.1009162Z [BUILD] Setting CUDA build args ... 2025-05-07T19:49:10.1021081Z [BUILD] Looking up CUDA version ... 2025-05-07T19:49:11.0430949Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:11.0431373Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:11.0431581Z 2025-05-07T19:49:12.0273301Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:12.0273921Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:12.0274222Z 2025-05-07T19:49:12.0709282Z + conda run -p /__w/_temp/conda_environment_14891846312 c++ --version | grep -i clang 2025-05-07T19:49:12.0709648Z 2025-05-07T19:49:13.0073461Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:13.0073890Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:13.0074104Z 2025-05-07T19:49:13.0505474Z 2025-05-07T19:49:13.0506055Z [BUILD] Setting NVCC flags ... 2025-05-07T19:49:13.0507596Z + conda env config vars set -p /__w/_temp/conda_environment_14891846312 NVCC_PREPEND_FLAGS="-std=c++20 -Xcompiler -std=c++20 -ccbin /opt/rh/gcc-toolset-11/root/usr/bin/c++ -allow-unsupported-compiler" 2025-05-07T19:49:13.0508364Z 2025-05-07T19:49:13.5261263Z To make your changes take effect please reactivate your environment 2025-05-07T19:49:13.5682417Z 2025-05-07T19:49:13.5682885Z + conda run -p /__w/_temp/conda_environment_14891846312 printenv NVCC_PREPEND_FLAGS 2025-05-07T19:49:13.5683235Z 2025-05-07T19:49:14.5076034Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:14.5076658Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:14.5076955Z 2025-05-07T19:49:14.5077485Z -std=c++20 -Xcompiler -std=c++20 -ccbin /opt/rh/gcc-toolset-11/root/usr/bin/c++ -allow-unsupported-compiler 2025-05-07T19:49:14.5078131Z 2025-05-07T19:49:14.5504518Z 2025-05-07T19:49:14.5505157Z [BUILD] Setting CUDA build args ... 2025-05-07T19:49:14.5505606Z + conda run -p /__w/_temp/conda_environment_14891846312 c++ --version 2025-05-07T19:49:14.5505889Z 2025-05-07T19:49:15.4892748Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:15.4893304Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:15.4893906Z 2025-05-07T19:49:15.4894011Z c++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9) 2025-05-07T19:49:15.4894339Z Copyright (C) 2021 Free Software Foundation, Inc. 2025-05-07T19:49:15.4894772Z This is free software; see the source for copying conditions. There is NO 2025-05-07T19:49:15.4895298Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 2025-05-07T19:49:15.4895617Z 2025-05-07T19:49:15.4895622Z 2025-05-07T19:49:15.5323182Z 2025-05-07T19:49:15.5324583Z + conda run -p /__w/_temp/conda_environment_14891846312 c++ --version | grep -i clang 2025-05-07T19:49:15.5325013Z 2025-05-07T19:49:16.4747459Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:16.4747875Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:16.4748072Z 2025-05-07T19:49:16.5174229Z 2025-05-07T19:49:16.5174967Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/.github/scripts/fbgemm_gpu_build.bash: line 370: [: : integer expression expected 2025-05-07T19:49:16.5178070Z [BUILD] Enabling debug features in the build ... 2025-05-07T19:49:16.5179669Z [BUILD] FBGEMM_GPU build arguments have been set: --verbose --build-target=genai --build-variant=cuda --nvml_lib_path=/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so --nccl_lib_path=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 -DTORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' -DCMAKE_CXX_STANDARD=20 --debug 2025-05-07T19:49:16.5181220Z ################################################################################ 2025-05-07T19:49:16.5181532Z # Build FBGEMM-GPU Package (Wheel) 2025-05-07T19:49:16.5181798Z # 2025-05-07T19:49:16.5213800Z # [2025-05-07T19:49:16.520Z] + build_fbgemm_gpu_package /__w/_temp/conda_environment_14891846312 nightly genai/cuda 2025-05-07T19:49:16.5214330Z ################################################################################ 2025-05-07T19:49:16.5214541Z 2025-05-07T19:49:16.5214723Z [BUILD] Building FBGEMM wheel (TARGET=genai, VARIANT=cuda) ... 2025-05-07T19:49:16.5218106Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m build --wheel --no-isolation --config-setting=--build-option=--verbose --config-setting=--build-option=--build-target=genai --config-setting=--build-option=--build-variant=cuda --config-setting=--build-option=--nvml_lib_path=/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so --config-setting=--build-option=--nccl_lib_path=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 --config-setting=--build-option=-DTORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' --config-setting=--build-option=-DCMAKE_CXX_STANDARD=20 --config-setting=--build-option=--debug --config-setting=--build-option=--package_channel=nightly --config-setting=--build-option=--python-tag=py39 --config-setting=--build-option=--plat-name=manylinux_2_28_x86_64 2025-05-07T19:49:16.5221171Z 2025-05-07T19:49:17.4278343Z WARNING: overwriting environment variables set in the machine 2025-05-07T19:49:17.4278789Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T19:49:17.5145825Z * Getting build dependencies for wheel... 2025-05-07T19:49:19.0070623Z INFO:root:running egg_info 2025-05-07T19:49:19.0104413Z INFO:root:creating fbgemm_gpu.egg-info 2025-05-07T19:49:19.0106075Z INFO:root:writing fbgemm_gpu.egg-info/PKG-INFO 2025-05-07T19:49:19.0110465Z INFO:root:writing dependency_links to fbgemm_gpu.egg-info/dependency_links.txt 2025-05-07T19:49:19.0113769Z INFO:root:writing requirements to fbgemm_gpu.egg-info/requires.txt 2025-05-07T19:49:19.0114603Z INFO:root:writing top-level names to fbgemm_gpu.egg-info/top_level.txt 2025-05-07T19:49:19.0116457Z INFO:root:writing manifest file 'fbgemm_gpu.egg-info/SOURCES.txt' 2025-05-07T19:49:19.0187142Z INFO:root:reading manifest file 'fbgemm_gpu.egg-info/SOURCES.txt' 2025-05-07T19:49:19.0203194Z INFO:root:writing manifest file 'fbgemm_gpu.egg-info/SOURCES.txt' 2025-05-07T19:49:19.0208396Z [SETUP.PY] ARGV: ['setup.py', 'egg_info'] 2025-05-07T19:49:19.0209696Z [SETUP.PY] Parsed setup.py arguments: Namespace(verbose=False, debug=False, dryrun=False, build_target='default', build_variant='cuda', package_channel='nightly', nvml_lib_path=None, nccl_lib_path=None, use_fb_only=False, cxxprefix=None) 2025-05-07T19:49:19.0210656Z [SETUP.PY] Other arguments: ['egg_info'] 2025-05-07T19:49:19.0211111Z [SETUP.PY] CUDA CUB directory environment variable not set. Using default CUB location. 2025-05-07T19:49:19.0211731Z [SETUP.PY] Using CUDA = /usr/local/cuda-12.8 2025-05-07T19:49:19.0212284Z [SETUP.PY] Generating version file at: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/fbgemm_gpu/docs/version.py 2025-05-07T19:49:19.0212869Z [SETUP.PY] Setting the FBGEMM build target: default ... 2025-05-07T19:49:19.0213248Z [SETUP.PY] Setting the FBGEMM build variant: cuda ... 2025-05-07T19:49:19.0214335Z [SETUP.PY] Passing CMake arguments: ['-DCMAKE_PREFIX_PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch', '-D_GLIBCXX_USE_CXX11_ABI=1', '-DFBGEMM_BUILD_TARGET=default', '-DFBGEMM_BUILD_VARIANT=cuda', "-DCMAKE_C_FLAGS=''", "-DCMAKE_CXX_FLAGS=''"] 2025-05-07T19:49:19.3945774Z * Building wheel... 2025-05-07T19:49:20.8922347Z [SETUP.PY] ARGV: ['setup.py', 'bdist_wheel', '--dist-dir', '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/dist/.tmp-4fpfzjvl', '--verbose', '--build-target=genai', '--build-variant=cuda', '--nvml_lib_path=/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so', '--nccl_lib_path=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2', '-DTORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a', '-DCMAKE_CXX_STANDARD=20', '--debug', '--package_channel=nightly', '--python-tag=py39', '--plat-name=manylinux_2_28_x86_64'] 2025-05-07T19:49:20.8925521Z [SETUP.PY] Parsed setup.py arguments: Namespace(verbose=True, debug=True, dryrun=False, build_target='genai', build_variant='cuda', package_channel='nightly', nvml_lib_path='/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so', nccl_lib_path='/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2', use_fb_only=False, cxxprefix=None) 2025-05-07T19:49:20.8927655Z [SETUP.PY] Other arguments: ['bdist_wheel', '--dist-dir', '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/dist/.tmp-4fpfzjvl', '-DTORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a', '-DCMAKE_CXX_STANDARD=20', '--python-tag=py39', '--plat-name=manylinux_2_28_x86_64'] 2025-05-07T19:49:20.8928772Z [SETUP.PY] CUDA CUB directory environment variable not set. Using default CUB location. 2025-05-07T19:49:20.8929254Z [SETUP.PY] Using CUDA = /usr/local/cuda-12.8 2025-05-07T19:49:20.8929809Z [SETUP.PY] Generating version file at: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/fbgemm_gpu/docs/version.py 2025-05-07T19:49:20.8930400Z [SETUP.PY] Setting the FBGEMM build target: genai ... 2025-05-07T19:49:20.8930855Z [SETUP.PY] Setting the FBGEMM build variant: cuda ... 2025-05-07T19:49:20.8934431Z [SETUP.PY] Passing CMake arguments: ['-DCMAKE_PREFIX_PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch', '-D_GLIBCXX_USE_CXX11_ABI=1', '-DCMAKE_VERBOSE_MAKEFILE=ON', '-DCMAKE_EXPORT_COMPILE_COMMANDS=TRUE', '-DFBGEMM_BUILD_TARGET=genai', '-DFBGEMM_BUILD_VARIANT=cuda', '-DNVML_LIB_PATH=/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so', '-DNCCL_INCLUDE_DIRS=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include', '-DNCCL_LIBRARIES=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2', "-DCMAKE_C_FLAGS='-DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib'", "-DCMAKE_CXX_FLAGS='-DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib'", '-DTORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a', '-DCMAKE_CXX_STANDARD=20'] 2025-05-07T19:49:20.8937852Z 2025-05-07T19:49:20.8938141Z 2025-05-07T19:49:20.8938303Z -------------------------------------------------------------------------------- 2025-05-07T19:49:20.8938658Z -- Trying 'Ninja' generator 2025-05-07T19:49:20.8938904Z -------------------------------- 2025-05-07T19:49:20.8939141Z --------------------------- 2025-05-07T19:49:20.8939363Z ---------------------- 2025-05-07T19:49:20.8939563Z ----------------- 2025-05-07T19:49:20.8939757Z ------------ 2025-05-07T19:49:20.8939934Z ------- 2025-05-07T19:49:20.8940106Z -- 2025-05-07T19:49:20.9116142Z CMake Deprecation Warning at CMakeLists.txt:1 (cmake_minimum_required): 2025-05-07T19:49:20.9117148Z Not searching for unused variables given on the command line. 2025-05-07T19:49:20.9118128Z Compatibility with CMake < 3.10 will be removed from a future version of 2025-05-07T19:49:20.9118869Z CMake. 2025-05-07T19:49:20.9119077Z 2025-05-07T19:49:20.9119492Z Update the VERSION argument value. Or, use the ... syntax 2025-05-07T19:49:20.9120501Z to tell CMake that the project requires at least but has been updated 2025-05-07T19:49:20.9121376Z to work with policies introduced by or earlier. 2025-05-07T19:49:20.9121840Z 2025-05-07T19:49:20.9121849Z 2025-05-07T19:49:20.9714200Z -- The C compiler identification is GNU 11.2.1 2025-05-07T19:49:20.9848620Z -- Detecting C compiler ABI info 2025-05-07T19:49:21.0496973Z -- Detecting C compiler ABI info - done 2025-05-07T19:49:21.0687198Z -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped 2025-05-07T19:49:21.0691307Z -- Detecting C compile features 2025-05-07T19:49:21.0695781Z -- Detecting C compile features - done 2025-05-07T19:49:21.1742424Z -- The CXX compiler identification is GNU 11.2.1 2025-05-07T19:49:21.1855763Z -- Detecting CXX compiler ABI info 2025-05-07T19:49:21.2797662Z -- Detecting CXX compiler ABI info - done 2025-05-07T19:49:21.2992120Z -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped 2025-05-07T19:49:21.2996033Z -- Detecting CXX compile features 2025-05-07T19:49:21.3005616Z -- Detecting CXX compile features - done 2025-05-07T19:49:21.3110273Z -- Configuring done (0.4s) 2025-05-07T19:49:21.3189391Z -- Generating done (0.0s) 2025-05-07T19:49:21.3203473Z -- Build files have been written to: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_cmake_test_compile/build 2025-05-07T19:49:21.3249142Z -- 2025-05-07T19:49:21.3249342Z ------- 2025-05-07T19:49:21.3249594Z ------------ 2025-05-07T19:49:21.3249818Z ----------------- 2025-05-07T19:49:21.3250035Z ---------------------- 2025-05-07T19:49:21.3250247Z --------------------------- 2025-05-07T19:49:21.3250477Z -------------------------------- 2025-05-07T19:49:21.3250742Z -- Trying 'Ninja' generator - success 2025-05-07T19:49:21.3251086Z -------------------------------------------------------------------------------- 2025-05-07T19:49:21.3251341Z 2025-05-07T19:49:21.3264980Z Configuring Project 2025-05-07T19:49:21.3265246Z Working directory: 2025-05-07T19:49:21.3265643Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build 2025-05-07T19:49:21.3266079Z Command: 2025-05-07T19:49:21.3275808Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/cmake/data/bin/cmake /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -G Ninja -DCMAKE_MAKE_PROGRAM:FILEPATH=/__w/_temp/conda_environment_14891846312/bin/ninja --no-warn-unused-cli -DCMAKE_INSTALL_PREFIX:PATH=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install -DPYTHON_VERSION_STRING:STRING=3.9.22 -DSKBUILD:INTERNAL=TRUE -DCMAKE_MODULE_PATH:PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/skbuild/resources/cmake -DPYTHON_EXECUTABLE:PATH=/__w/_temp/conda_environment_14891846312/bin/python -DPYTHON_INCLUDE_DIR:PATH=/__w/_temp/conda_environment_14891846312/include/python3.9 -DPYTHON_LIBRARY:PATH=/__w/_temp/conda_environment_14891846312/lib/libpython3.9.so -DPython_EXECUTABLE:PATH=/__w/_temp/conda_environment_14891846312/bin/python -DPython_ROOT_DIR:PATH=/__w/_temp/conda_environment_14891846312 -DPython_FIND_REGISTRY:STRING=NEVER -DPython_INCLUDE_DIR:PATH=/__w/_temp/conda_environment_14891846312/include/python3.9 -DPython_NumPy_INCLUDE_DIRS:PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/numpy/_core/include -DPython3_EXECUTABLE:PATH=/__w/_temp/conda_environment_14891846312/bin/python -DPython3_ROOT_DIR:PATH=/__w/_temp/conda_environment_14891846312 -DPython3_FIND_REGISTRY:STRING=NEVER -DPython3_INCLUDE_DIR:PATH=/__w/_temp/conda_environment_14891846312/include/python3.9 -DPython3_NumPy_INCLUDE_DIRS:PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/numpy/_core/include -DCMAKE_MAKE_PROGRAM:FILEPATH=/__w/_temp/conda_environment_14891846312/bin/ninja -DCMAKE_PREFIX_PATH=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch -D_GLIBCXX_USE_CXX11_ABI=1 -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_EXPORT_COMPILE_COMMANDS=TRUE -DFBGEMM_BUILD_TARGET=genai -DFBGEMM_BUILD_VARIANT=cuda -DNVML_LIB_PATH=/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so -DNCCL_INCLUDE_DIRS=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -DNCCL_LIBRARIES=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 '-DCMAKE_C_FLAGS='"'"'-DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib'"'"'' '-DCMAKE_CXX_FLAGS='"'"'-DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib'"'"'' '-DTORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' -DCMAKE_CXX_STANDARD=20 '-DTORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' -DCMAKE_CXX_STANDARD=20 -DCMAKE_BUILD_TYPE:STRING=Release 2025-05-07T19:49:21.3285933Z 2025-05-07T19:49:21.3467587Z 2025-05-07T19:49:21.3467801Z Not searching for unused variables given on the command line. 2025-05-07T19:49:21.3468163Z 2025-05-07T19:49:21.3468318Z ================================================================================ 2025-05-07T19:49:21.3468699Z Default C compiler flags 2025-05-07T19:49:21.3469028Z (values may be overridden by CMAKE_CXX_STANDARD and CXX_STANDARD): 2025-05-07T19:49:21.3469314Z 2025-05-07T19:49:21.3469693Z -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib 2025-05-07T19:49:21.3470283Z ================================================================================ 2025-05-07T19:49:21.3470501Z 2025-05-07T19:49:21.3470506Z 2025-05-07T19:49:21.3470510Z 2025-05-07T19:49:21.3470611Z ================================================================================ 2025-05-07T19:49:21.3470916Z Default C++ compiler flags 2025-05-07T19:49:21.3471248Z (values may be overridden by CMAKE_CXX_STANDARD and CXX_STANDARD): 2025-05-07T19:49:21.3471529Z 2025-05-07T19:49:21.3471898Z -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib 2025-05-07T19:49:21.3472606Z ================================================================================ 2025-05-07T19:49:21.3472813Z 2025-05-07T19:49:21.3472817Z 2025-05-07T19:49:21.3472821Z 2025-05-07T19:49:21.3472921Z ================================================================================ 2025-05-07T19:49:21.3473203Z AVX2_FLAGS: 2025-05-07T19:49:21.3473313Z 2025-05-07T19:49:21.3473388Z -mavx2 2025-05-07T19:49:21.3473552Z -mf16c 2025-05-07T19:49:21.3473723Z -mfma 2025-05-07T19:49:21.3473893Z -fopenmp 2025-05-07T19:49:21.3474101Z ================================================================================ 2025-05-07T19:49:21.3474347Z 2025-05-07T19:49:21.3474351Z 2025-05-07T19:49:21.3474355Z 2025-05-07T19:49:21.3474463Z ================================================================================ 2025-05-07T19:49:21.3474737Z AVX512_FLAGS: 2025-05-07T19:49:21.3474874Z 2025-05-07T19:49:21.3474947Z -mavx2 2025-05-07T19:49:21.3475108Z -mf16c 2025-05-07T19:49:21.3475470Z -mfma 2025-05-07T19:49:21.3475635Z -mavx512f 2025-05-07T19:49:21.3475813Z -mavx512bw 2025-05-07T19:49:21.3483198Z -mavx512dq 2025-05-07T19:49:21.3483424Z -mavx512vl 2025-05-07T19:49:21.3483636Z -fopenmp 2025-05-07T19:49:21.3483873Z ================================================================================ 2025-05-07T19:49:21.3484093Z 2025-05-07T19:49:21.3484097Z 2025-05-07T19:49:21.3484101Z 2025-05-07T19:49:21.3484215Z ================================================================================ 2025-05-07T19:49:21.3484699Z The project is built using scikit-build 2025-05-07T19:49:21.3485260Z ================================================================================ 2025-05-07T19:49:21.3485470Z 2025-05-07T19:49:21.3485474Z 2025-05-07T19:49:21.3485478Z 2025-05-07T19:49:21.3485585Z ================================================================================ 2025-05-07T19:49:21.3485864Z Build Settings 2025-05-07T19:49:21.3485985Z 2025-05-07T19:49:21.3486092Z FBGEMM_BUILD_TARGET : genai 2025-05-07T19:49:21.3486359Z FBGEMM_BUILD_VARIANT : cuda 2025-05-07T19:49:21.3486533Z 2025-05-07T19:49:21.3486618Z NVCC_VERBOSE : 2025-05-07T19:49:21.3486848Z CUDNN_INCLUDE_DIR : 2025-05-07T19:49:21.3487084Z CUDNN_LIBRARY : 2025-05-07T19:49:21.3487479Z NVML_LIB_PATH : /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:21.3487906Z TORCH_CUDA_ARCH_LIST : 7.0 2025-05-07T19:49:21.3488140Z 8.0 2025-05-07T19:49:21.3488301Z 9.0 2025-05-07T19:49:21.3488474Z 9.0a 2025-05-07T19:49:21.3488636Z 10.0a 2025-05-07T19:49:21.3488805Z 12.0a 2025-05-07T19:49:21.3488904Z 2025-05-07T19:49:21.3488987Z HIP_ROOT_DIR : 2025-05-07T19:49:21.3489223Z HIPCC_VERBOSE : 2025-05-07T19:49:21.3489450Z AMDGPU_TARGETS : 2025-05-07T19:49:21.3489687Z PYTORCH_ROCM_ARCH : 2025-05-07T19:49:21.3489935Z ================================================================================ 2025-05-07T19:49:21.3490156Z 2025-05-07T19:49:21.4473887Z -- The CXX compiler identification is GNU 11.2.1 2025-05-07T19:49:21.4994636Z -- The C compiler identification is GNU 11.2.1 2025-05-07T19:49:22.6417635Z -- The CUDA compiler identification is NVIDIA 12.8.61 with host compiler GNU 11.2.1 2025-05-07T19:49:22.6533360Z -- Detecting CXX compiler ABI info 2025-05-07T19:49:22.7502338Z -- Detecting CXX compiler ABI info - done 2025-05-07T19:49:22.7699320Z -- Check for working CXX compiler: /opt/rh/gcc-toolset-11/root/usr/bin/c++ - skipped 2025-05-07T19:49:22.7702902Z -- Detecting CXX compile features 2025-05-07T19:49:22.7712673Z -- Detecting CXX compile features - done 2025-05-07T19:49:22.7896211Z -- Detecting C compiler ABI info 2025-05-07T19:49:22.8552436Z -- Detecting C compiler ABI info - done 2025-05-07T19:49:22.8742982Z -- Check for working C compiler: /opt/rh/gcc-toolset-11/root/usr/bin/cc - skipped 2025-05-07T19:49:22.8746845Z -- Detecting C compile features 2025-05-07T19:49:22.8752103Z -- Detecting C compile features - done 2025-05-07T19:49:22.8902594Z -- Detecting CUDA compiler ABI info 2025-05-07T19:49:23.9580115Z -- Detecting CUDA compiler ABI info - done 2025-05-07T19:49:24.0250127Z -- Check for working CUDA compiler: /usr/local/cuda-12.8/bin/nvcc - skipped 2025-05-07T19:49:24.0292049Z -- Detecting CUDA compile features 2025-05-07T19:49:24.0296908Z -- Detecting CUDA compile features - done 2025-05-07T19:49:24.0414523Z -- Performing Test C_HAS_AVX_1 2025-05-07T19:49:24.2566156Z -- Performing Test C_HAS_AVX_1 - Failed 2025-05-07T19:49:24.2566553Z -- Performing Test C_HAS_AVX_2 2025-05-07T19:49:24.4955357Z -- Performing Test C_HAS_AVX_2 - Success 2025-05-07T19:49:24.4958194Z -- Performing Test C_HAS_AVX2_1 2025-05-07T19:49:24.7114151Z -- Performing Test C_HAS_AVX2_1 - Failed 2025-05-07T19:49:24.7114613Z -- Performing Test C_HAS_AVX2_2 2025-05-07T19:49:24.9457841Z -- Performing Test C_HAS_AVX2_2 - Success 2025-05-07T19:49:24.9459989Z -- Performing Test C_HAS_AVX512_1 2025-05-07T19:49:25.1606389Z -- Performing Test C_HAS_AVX512_1 - Failed 2025-05-07T19:49:25.1607068Z -- Performing Test C_HAS_AVX512_2 2025-05-07T19:49:25.5236565Z -- Performing Test C_HAS_AVX512_2 - Failed 2025-05-07T19:49:25.5237165Z -- Performing Test C_HAS_AVX512_3 2025-05-07T19:49:25.7366281Z -- Performing Test C_HAS_AVX512_3 - Failed 2025-05-07T19:49:25.7367292Z -- Performing Test CXX_HAS_AVX_1 2025-05-07T19:49:25.9505447Z -- Performing Test CXX_HAS_AVX_1 - Failed 2025-05-07T19:49:25.9506290Z -- Performing Test CXX_HAS_AVX_2 2025-05-07T19:49:26.1883326Z -- Performing Test CXX_HAS_AVX_2 - Success 2025-05-07T19:49:26.1885596Z -- Performing Test CXX_HAS_AVX2_1 2025-05-07T19:49:26.4022779Z -- Performing Test CXX_HAS_AVX2_1 - Failed 2025-05-07T19:49:26.4023209Z -- Performing Test CXX_HAS_AVX2_2 2025-05-07T19:49:26.6382430Z -- Performing Test CXX_HAS_AVX2_2 - Success 2025-05-07T19:49:26.6384578Z -- Performing Test CXX_HAS_AVX512_1 2025-05-07T19:49:26.8534211Z -- Performing Test CXX_HAS_AVX512_1 - Failed 2025-05-07T19:49:26.8534689Z -- Performing Test CXX_HAS_AVX512_2 2025-05-07T19:49:27.2598730Z -- Performing Test CXX_HAS_AVX512_2 - Failed 2025-05-07T19:49:27.2599900Z -- Performing Test CXX_HAS_AVX512_3 2025-05-07T19:49:27.4738851Z -- Performing Test CXX_HAS_AVX512_3 - Failed 2025-05-07T19:49:27.5007399Z -- Found CUDA: /usr/local/cuda-12.8 (found version "12.8") 2025-05-07T19:49:27.5054168Z -- Found CUDAToolkit: /usr/local/cuda-12.8/include (found version "12.8.61") 2025-05-07T19:49:27.5136481Z -- Performing Test CMAKE_HAVE_LIBC_PTHREAD 2025-05-07T19:49:27.5851015Z -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed 2025-05-07T19:49:27.5851419Z -- Looking for pthread_create in pthreads 2025-05-07T19:49:27.6412061Z -- Looking for pthread_create in pthreads - not found 2025-05-07T19:49:27.6412466Z -- Looking for pthread_create in pthread 2025-05-07T19:49:27.7077161Z -- Looking for pthread_create in pthread - found 2025-05-07T19:49:27.7088854Z -- Found Threads: TRUE 2025-05-07T19:49:27.8383670Z -- PyTorch: CUDA detected: 12.8 2025-05-07T19:49:27.8384006Z -- PyTorch: CUDA nvcc is: /usr/local/cuda-12.8/bin/nvcc 2025-05-07T19:49:27.8384413Z -- PyTorch: CUDA toolkit directory: /usr/local/cuda-12.8 2025-05-07T19:49:27.9749955Z -- PyTorch: Header version is: 12.8 2025-05-07T19:49:28.1702456Z -- Found Python: /__w/_temp/conda_environment_14891846312/bin/python (found version "3.9.22") found components: Interpreter 2025-05-07T19:49:28.1725944Z CMake Warning at /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:140 (message): 2025-05-07T19:49:28.1726695Z -- USE_CUDNN is set to 0. Compiling without cuDNN support 2025-05-07T19:49:28.1727119Z -- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support 2025-05-07T19:49:28.1727548Z -- USE_CUDSS is set to 0. Compiling without cuDSS support 2025-05-07T19:49:28.1727928Z -- USE_CUFILE is set to 0. Compiling without cuFile support 2025-05-07T19:49:28.1728308Z Failed to compute shorthash for libnvrtc.so 2025-05-07T19:49:28.1728618Z Call Stack (most recent call first): 2025-05-07T19:49:28.1729226Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include) 2025-05-07T19:49:28.1730257Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package) 2025-05-07T19:49:28.1731143Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/cmake/modules/PyTorchSetup.cmake:14 (find_package) 2025-05-07T19:49:28.1731620Z CMakeLists.txt:112 (include) 2025-05-07T19:49:28.1731789Z 2025-05-07T19:49:28.1731793Z 2025-05-07T19:49:28.1732573Z -- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_90a,code=sm_90a;-gencode;arch=compute_100a,code=sm_100a;-gencode;arch=compute_120a,code=sm_120a 2025-05-07T19:49:28.2614429Z CMake Warning at /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message): 2025-05-07T19:49:28.2615146Z static library kineto_LIBRARY-NOTFOUND not found. 2025-05-07T19:49:28.2615745Z Call Stack (most recent call first): 2025-05-07T19:49:28.2616464Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:125 (append_torchlib_if_found) 2025-05-07T19:49:28.2617319Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/cmake/modules/PyTorchSetup.cmake:14 (find_package) 2025-05-07T19:49:28.2617787Z CMakeLists.txt:112 (include) 2025-05-07T19:49:28.2617964Z 2025-05-07T19:49:28.2617968Z 2025-05-07T19:49:28.2625190Z -- Found Torch: /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so 2025-05-07T19:49:28.2625756Z 2025-05-07T19:49:28.2625855Z 2025-05-07T19:49:28.2626027Z ================================================================================ 2025-05-07T19:49:28.2626355Z PyTorch Flags: 2025-05-07T19:49:28.2626548Z 2025-05-07T19:49:28.2626750Z TORCH_INCLUDE_DIRS: 2025-05-07T19:49:28.2627126Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include 2025-05-07T19:49:28.2627806Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include 2025-05-07T19:49:28.2628292Z 2025-05-07T19:49:28.2628472Z TORCH_LIBRARIES: 2025-05-07T19:49:28.2628669Z torch 2025-05-07T19:49:28.2628844Z torch_library 2025-05-07T19:49:28.2629214Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so 2025-05-07T19:49:28.2629663Z /usr/local/cuda-12.8/lib64/libnvrtc.so 2025-05-07T19:49:28.2630138Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so 2025-05-07T19:49:28.2630570Z 2025-05-07T19:49:28.2630749Z TORCH_CUDA_OPTIONS: 2025-05-07T19:49:28.2630971Z --expt-relaxed-constexpr 2025-05-07T19:49:28.2631226Z -D__CUDA_NO_HALF_OPERATORS__ 2025-05-07T19:49:28.2631484Z -D__CUDA_NO_BFLOAT16_CONVERSIONS__ 2025-05-07T19:49:28.2631782Z -D__CUDA_NO_HALF2_OPERATORS__ 2025-05-07T19:49:28.2632150Z ================================================================================ 2025-05-07T19:49:28.2632371Z 2025-05-07T19:49:28.2632376Z 2025-05-07T19:49:28.2632380Z 2025-05-07T19:49:28.2632483Z ================================================================================ 2025-05-07T19:49:28.2632768Z NCCL Flags 2025-05-07T19:49:28.2632880Z 2025-05-07T19:49:28.2633203Z NCCL_INCLUDE_DIRS=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include 2025-05-07T19:49:28.2633962Z NCCL_LIBRARIES=/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 2025-05-07T19:49:28.2634504Z ================================================================================ 2025-05-07T19:49:28.2634711Z 2025-05-07T19:49:28.2634715Z 2025-05-07T19:49:28.2634719Z 2025-05-07T19:49:28.2634818Z ================================================================================ 2025-05-07T19:49:28.2635108Z CUDA Driver Path 2025-05-07T19:49:28.2635236Z 2025-05-07T19:49:28.2635351Z CUDA_DRIVER_LIBRARIES=/usr/lib64/libcuda.so 2025-05-07T19:49:28.2635658Z ================================================================================ 2025-05-07T19:49:28.2635886Z 2025-05-07T19:49:28.2636181Z -- Found NVML_LIB_PATH: /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:28.2666363Z 2025-05-07T19:49:28.2666618Z 2025-05-07T19:49:28.2666783Z ================================================================================ 2025-05-07T19:49:28.2667185Z GPU CPP Library Target: asmjit (SHARED) 2025-05-07T19:49:28.2667495Z 2025-05-07T19:49:28.2667675Z CPU_SRCS: 2025-05-07T19:49:28.2667786Z 2025-05-07T19:49:28.2667854Z 2025-05-07T19:49:28.2668115Z GPU_SRCS: 2025-05-07T19:49:28.2668229Z 2025-05-07T19:49:28.2668297Z 2025-05-07T19:49:28.2668536Z CUDA_SPECIFIC_SRCS: 2025-05-07T19:49:28.2668692Z 2025-05-07T19:49:28.2668772Z 2025-05-07T19:49:28.2668944Z HIP_SPECIFIC_SRCS: 2025-05-07T19:49:28.2669076Z 2025-05-07T19:49:28.2669149Z 2025-05-07T19:49:28.2669371Z OTHER_SRCS: 2025-05-07T19:49:28.2670098Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64assembler.cpp 2025-05-07T19:49:28.2670890Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64builder.cpp 2025-05-07T19:49:28.2671657Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64compiler.cpp 2025-05-07T19:49:28.2672780Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64emithelper.cpp 2025-05-07T19:49:28.2673881Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64formatter.cpp 2025-05-07T19:49:28.2674627Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64func.cpp 2025-05-07T19:49:28.2675298Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64instapi.cpp 2025-05-07T19:49:28.2675973Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64instdb.cpp 2025-05-07T19:49:28.2676640Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64operand.cpp 2025-05-07T19:49:28.2677317Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/a64rapass.cpp 2025-05-07T19:49:28.2677996Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/arm/armformatter.cpp 2025-05-07T19:49:28.2678683Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/archtraits.cpp 2025-05-07T19:49:28.2679373Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/assembler.cpp 2025-05-07T19:49:28.2680035Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/builder.cpp 2025-05-07T19:49:28.2680704Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/codeholder.cpp 2025-05-07T19:49:28.2681384Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/codewriter.cpp 2025-05-07T19:49:28.2682053Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/compiler.cpp 2025-05-07T19:49:28.2682741Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/constpool.cpp 2025-05-07T19:49:28.2683401Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/cpuinfo.cpp 2025-05-07T19:49:28.2684073Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/emithelper.cpp 2025-05-07T19:49:28.2684742Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/emitter.cpp 2025-05-07T19:49:28.2685785Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/emitterutils.cpp 2025-05-07T19:49:28.2686609Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/environment.cpp 2025-05-07T19:49:28.2687425Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/errorhandler.cpp 2025-05-07T19:49:28.2688235Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/formatter.cpp 2025-05-07T19:49:28.2689001Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/func.cpp 2025-05-07T19:49:28.2689804Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/funcargscontext.cpp 2025-05-07T19:49:28.2690618Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/globals.cpp 2025-05-07T19:49:28.2691365Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/inst.cpp 2025-05-07T19:49:28.2692118Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/instdb.cpp 2025-05-07T19:49:28.2692903Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/jitallocator.cpp 2025-05-07T19:49:28.2693716Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/jitruntime.cpp 2025-05-07T19:49:28.2694499Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/logger.cpp 2025-05-07T19:49:28.2695253Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/operand.cpp 2025-05-07T19:49:28.2696914Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/osutils.cpp 2025-05-07T19:49:28.2697642Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/ralocal.cpp 2025-05-07T19:49:28.2698291Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/rapass.cpp 2025-05-07T19:49:28.2699058Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/rastack.cpp 2025-05-07T19:49:28.2699711Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/string.cpp 2025-05-07T19:49:28.2700355Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/support.cpp 2025-05-07T19:49:28.2701003Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/target.cpp 2025-05-07T19:49:28.2701636Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/type.cpp 2025-05-07T19:49:28.2702289Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/virtmem.cpp 2025-05-07T19:49:28.2702941Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/zone.cpp 2025-05-07T19:49:28.2703600Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/zonehash.cpp 2025-05-07T19:49:28.2704272Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/zonelist.cpp 2025-05-07T19:49:28.2704947Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/zonestack.cpp 2025-05-07T19:49:28.2705624Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/zonetree.cpp 2025-05-07T19:49:28.2706306Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/core/zonevector.cpp 2025-05-07T19:49:28.2707035Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86assembler.cpp 2025-05-07T19:49:28.2707722Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86builder.cpp 2025-05-07T19:49:28.2708401Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86compiler.cpp 2025-05-07T19:49:28.2709095Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86emithelper.cpp 2025-05-07T19:49:28.2709793Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86formatter.cpp 2025-05-07T19:49:28.2710459Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86func.cpp 2025-05-07T19:49:28.2711124Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86instapi.cpp 2025-05-07T19:49:28.2711788Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86instdb.cpp 2025-05-07T19:49:28.2712665Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86operand.cpp 2025-05-07T19:49:28.2713327Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src/asmjit/x86/x86rapass.cpp 2025-05-07T19:49:28.2713788Z 2025-05-07T19:49:28.2713962Z CC_FLAGS: 2025-05-07T19:49:28.2714071Z 2025-05-07T19:49:28.2714138Z 2025-05-07T19:49:28.2714307Z NVCC_FLAGS: 2025-05-07T19:49:28.2714417Z 2025-05-07T19:49:28.2714483Z 2025-05-07T19:49:28.2714652Z HIPCC_FLAGS: 2025-05-07T19:49:28.2714768Z 2025-05-07T19:49:28.2714834Z 2025-05-07T19:49:28.2715000Z INCLUDE_DIRS: 2025-05-07T19:49:28.2715259Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:28.2715618Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:49:28.2715967Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include 2025-05-07T19:49:28.2716340Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:28.2716808Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include 2025-05-07T19:49:28.2717457Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include 2025-05-07T19:49:28.2718073Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src 2025-05-07T19:49:28.2718668Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include 2025-05-07T19:49:28.2719195Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include 2025-05-07T19:49:28.2719758Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include 2025-05-07T19:49:28.2720353Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include 2025-05-07T19:49:28.2720988Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include 2025-05-07T19:49:28.2721536Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include 2025-05-07T19:49:28.2721953Z 2025-05-07T19:49:28.2722141Z Selected Source Files: 2025-05-07T19:49:28.2722288Z 2025-05-07T19:49:28.2722356Z 2025-05-07T19:49:28.2722537Z HIPified Source Files: 2025-05-07T19:49:28.2722686Z 2025-05-07T19:49:28.2722753Z 2025-05-07T19:49:28.2722936Z Library Dependencies: 2025-05-07T19:49:28.2723155Z torch 2025-05-07T19:49:28.2723326Z torch_library 2025-05-07T19:49:28.2723697Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so 2025-05-07T19:49:28.2724140Z /usr/local/cuda-12.8/lib64/libnvrtc.so 2025-05-07T19:49:28.2724602Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so 2025-05-07T19:49:28.2725253Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 2025-05-07T19:49:28.2725724Z /usr/lib64/libcuda.so 2025-05-07T19:49:28.2726050Z /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:28.2726413Z 2025-05-07T19:49:28.2726588Z Output Library: 2025-05-07T19:49:28.2726777Z asmjit 2025-05-07T19:49:28.2726947Z 2025-05-07T19:49:28.2727123Z Destination Directory: 2025-05-07T19:49:28.2727342Z fbgemm_gpu 2025-05-07T19:49:28.2727555Z ================================================================================ 2025-05-07T19:49:28.2727777Z 2025-05-07T19:49:28.2727782Z 2025-05-07T19:49:28.2727786Z 2025-05-07T19:49:28.2727889Z ================================================================================ 2025-05-07T19:49:28.2728195Z GPU CPP Library Target: fbgemm (SHARED) 2025-05-07T19:49:28.2728463Z 2025-05-07T19:49:28.2728627Z CPU_SRCS: 2025-05-07T19:49:28.2728732Z 2025-05-07T19:49:28.2728796Z 2025-05-07T19:49:28.2728958Z GPU_SRCS: 2025-05-07T19:49:28.2729061Z 2025-05-07T19:49:28.2729126Z 2025-05-07T19:49:28.2729297Z CUDA_SPECIFIC_SRCS: 2025-05-07T19:49:28.2729436Z 2025-05-07T19:49:28.2729500Z 2025-05-07T19:49:28.2729674Z HIP_SPECIFIC_SRCS: 2025-05-07T19:49:28.2729805Z 2025-05-07T19:49:28.2729869Z 2025-05-07T19:49:28.2730032Z OTHER_SRCS: 2025-05-07T19:49:28.2730348Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/EmbeddingSpMDM.cc 2025-05-07T19:49:28.2730877Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/EmbeddingSpMDMAutovec.cc 2025-05-07T19:49:28.2731430Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/EmbeddingSpMDMNBit.cc 2025-05-07T19:49:28.2731937Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/QuantUtils.cc 2025-05-07T19:49:28.2732439Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/RefImplementations.cc 2025-05-07T19:49:28.2732999Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/RowWiseSparseAdagradFused.cc 2025-05-07T19:49:28.2733547Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/SparseAdagrad.cc 2025-05-07T19:49:28.2734008Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/Utils.cc 2025-05-07T19:49:28.2734487Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/EmbeddingSpMDMAvx2.cc 2025-05-07T19:49:28.2735006Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../src/QuantUtilsAvx2.cc 2025-05-07T19:49:28.2735363Z 2025-05-07T19:49:28.2735528Z CC_FLAGS: 2025-05-07T19:49:28.2735635Z 2025-05-07T19:49:28.2735704Z 2025-05-07T19:49:28.2735870Z NVCC_FLAGS: 2025-05-07T19:49:28.2735979Z 2025-05-07T19:49:28.2736043Z 2025-05-07T19:49:28.2736212Z HIPCC_FLAGS: 2025-05-07T19:49:28.2736423Z 2025-05-07T19:49:28.2736495Z 2025-05-07T19:49:28.2736672Z INCLUDE_DIRS: 2025-05-07T19:49:28.2736970Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:28.2737316Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:49:28.2737663Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include 2025-05-07T19:49:28.2738029Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:28.2738495Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include 2025-05-07T19:49:28.2739247Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include 2025-05-07T19:49:28.2739861Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src 2025-05-07T19:49:28.2740364Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include 2025-05-07T19:49:28.2740881Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include 2025-05-07T19:49:28.2741441Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include 2025-05-07T19:49:28.2742043Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include 2025-05-07T19:49:28.2742591Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include 2025-05-07T19:49:28.2743130Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include 2025-05-07T19:49:28.2743549Z 2025-05-07T19:49:28.2743726Z Selected Source Files: 2025-05-07T19:49:28.2743870Z 2025-05-07T19:49:28.2743935Z 2025-05-07T19:49:28.2744118Z HIPified Source Files: 2025-05-07T19:49:28.2744260Z 2025-05-07T19:49:28.2744325Z 2025-05-07T19:49:28.2744501Z Library Dependencies: 2025-05-07T19:49:28.2744706Z torch 2025-05-07T19:49:28.2744880Z torch_library 2025-05-07T19:49:28.2745240Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so 2025-05-07T19:49:28.2745689Z /usr/local/cuda-12.8/lib64/libnvrtc.so 2025-05-07T19:49:28.2746149Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so 2025-05-07T19:49:28.2746852Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 2025-05-07T19:49:28.2747314Z /usr/lib64/libcuda.so 2025-05-07T19:49:28.2747512Z asmjit 2025-05-07T19:49:28.2747804Z /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:28.2748149Z 2025-05-07T19:49:28.2748319Z Output Library: 2025-05-07T19:49:28.2748505Z fbgemm 2025-05-07T19:49:28.2748673Z 2025-05-07T19:49:28.2748849Z Destination Directory: 2025-05-07T19:49:28.2749066Z fbgemm_gpu 2025-05-07T19:49:28.2749282Z ================================================================================ 2025-05-07T19:49:28.2749489Z 2025-05-07T19:49:28.2749493Z 2025-05-07T19:49:28.2749497Z 2025-05-07T19:49:28.2749598Z ================================================================================ 2025-05-07T19:49:28.2749906Z Running code generation script ... 2025-05-07T19:49:28.2750624Z /__w/_temp/conda_environment_14891846312/bin/python /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_backward_split.py --opensource 2025-05-07T19:49:28.2751361Z ================================================================================ 2025-05-07T19:49:28.2751567Z 2025-05-07T19:49:28.9528125Z [ARGS PARSE] Parsed arguments: Namespace(install_dir='.', is_fbcode=False, is_rocm=False) 2025-05-07T19:49:28.9529268Z [GENERAATE BACKWARD SPLIT]: ['/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_backward_split.py', '--opensource'] 2025-05-07T19:49:28.9530217Z Written: gen_embedding_backward_dense_split_weighted_vbe_cuda.cu 2025-05-07T19:49:28.9530663Z Written: gen_embedding_backward_dense_split_weighted_cuda.cu 2025-05-07T19:49:28.9531119Z Written: gen_embedding_backward_dense_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:28.9531592Z Written: gen_embedding_backward_dense_split_unweighted_vbe_cuda.cu 2025-05-07T19:49:28.9532047Z Written: gen_embedding_backward_dense_split_unweighted_cuda.cu 2025-05-07T19:49:28.9532688Z Written: gen_embedding_backward_dense_split_weighted_vbe_meta.cpp 2025-05-07T19:49:28.9533138Z Written: gen_embedding_backward_dense_split_weighted_meta.cpp 2025-05-07T19:49:28.9533597Z Written: gen_embedding_backward_dense_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:28.9534080Z Written: gen_embedding_backward_dense_split_unweighted_vbe_meta.cpp 2025-05-07T19:49:28.9534542Z Written: gen_embedding_backward_dense_split_unweighted_meta.cpp 2025-05-07T19:49:28.9535121Z Written: gen_embedding_backward_dense_split_weighted_vbe_kernel_cta.cu 2025-05-07T19:49:28.9535609Z Written: gen_embedding_backward_dense_split_weighted_kernel_cta.cu 2025-05-07T19:49:28.9536100Z Written: gen_embedding_backward_dense_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:28.9536623Z Written: gen_embedding_backward_dense_split_unweighted_vbe_kernel_cta.cu 2025-05-07T19:49:28.9537115Z Written: gen_embedding_backward_dense_split_unweighted_kernel_cta.cu 2025-05-07T19:49:28.9537615Z Written: gen_embedding_backward_dense_split_weighted_vbe_kernel_warp.cu 2025-05-07T19:49:28.9538112Z Written: gen_embedding_backward_dense_split_weighted_kernel_warp.cu 2025-05-07T19:49:28.9538607Z Written: gen_embedding_backward_dense_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:28.9539132Z Written: gen_embedding_backward_dense_split_unweighted_vbe_kernel_warp.cu 2025-05-07T19:49:28.9539632Z Written: gen_embedding_backward_dense_split_unweighted_kernel_warp.cu 2025-05-07T19:49:28.9540109Z Written: gen_embedding_optimizer_dense_split_device_kernel.cuh 2025-05-07T19:49:28.9540501Z Written: gen_embedding_backward_split_dense.cpp 2025-05-07T19:49:28.9540860Z Written: gen_embedding_backward_dense_split_cpu.cpp 2025-05-07T19:49:28.9541262Z Written: gen_embedding_backward_adagrad_split_weighted_cuda.cu 2025-05-07T19:49:28.9541728Z Written: gen_embedding_backward_adagrad_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:28.9542210Z Written: gen_embedding_backward_adagrad_split_unweighted_cuda.cu 2025-05-07T19:49:28.9542655Z Written: gen_embedding_backward_adagrad_split_weighted_meta.cpp 2025-05-07T19:49:28.9543140Z Written: gen_embedding_backward_adagrad_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:28.9543620Z Written: gen_embedding_backward_adagrad_split_unweighted_meta.cpp 2025-05-07T19:49:28.9544099Z Written: gen_embedding_backward_adagrad_split_weighted_kernel_cta.cu 2025-05-07T19:49:28.9544606Z Written: gen_embedding_backward_adagrad_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:28.9545128Z Written: gen_embedding_backward_adagrad_split_unweighted_kernel_cta.cu 2025-05-07T19:49:28.9545621Z Written: gen_embedding_backward_adagrad_split_weighted_kernel_warp.cu 2025-05-07T19:49:28.9546129Z Written: gen_embedding_backward_adagrad_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:28.9546655Z Written: gen_embedding_backward_adagrad_split_unweighted_kernel_warp.cu 2025-05-07T19:49:28.9547127Z Written: gen_embedding_optimizer_adagrad_split_device_kernel.cuh 2025-05-07T19:49:28.9547533Z Written: gen_embedding_backward_split_adagrad.cpp 2025-05-07T19:49:28.9547910Z Written: gen_embedding_split_adagrad_pt2_autograd.cpp 2025-05-07T19:49:28.9548331Z Written: gen_embedding_backward_split_adagrad_pt2_cuda_wrapper.cpp 2025-05-07T19:49:28.9548708Z Written: lookup_adagrad.py 2025-05-07T19:49:28.9549003Z Written: gen_embedding_backward_adagrad_split_cpu.cpp 2025-05-07T19:49:28.9549383Z Written: gen_embedding_backward_split_adagrad_cpu.cpp 2025-05-07T19:49:28.9549813Z Written: gen_embedding_backward_split_adagrad_pt2_cpu_wrapper.cpp 2025-05-07T19:49:28.9550266Z Written: gen_embedding_backward_adam_split_weighted_vbe_cuda.cu 2025-05-07T19:49:28.9550702Z Written: gen_embedding_backward_adam_split_weighted_cuda.cu 2025-05-07T19:49:28.9551141Z Written: gen_embedding_backward_adam_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:28.9551607Z Written: gen_embedding_backward_adam_split_unweighted_vbe_cuda.cu 2025-05-07T19:49:28.9552156Z Written: gen_embedding_backward_adam_split_unweighted_cuda.cu 2025-05-07T19:49:28.9552594Z Written: gen_embedding_backward_adam_split_weighted_vbe_meta.cpp 2025-05-07T19:49:28.9553123Z Written: gen_embedding_backward_adam_split_weighted_meta.cpp 2025-05-07T19:49:28.9560957Z Written: gen_embedding_backward_adam_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:28.9561439Z Written: gen_embedding_backward_adam_split_unweighted_vbe_meta.cpp 2025-05-07T19:49:28.9561904Z Written: gen_embedding_backward_adam_split_unweighted_meta.cpp 2025-05-07T19:49:28.9562377Z Written: gen_embedding_backward_adam_split_weighted_vbe_kernel_cta.cu 2025-05-07T19:49:28.9562963Z Written: gen_embedding_backward_adam_split_weighted_kernel_cta.cu 2025-05-07T19:49:28.9563464Z Written: gen_embedding_backward_adam_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:28.9563974Z Written: gen_embedding_backward_adam_split_unweighted_vbe_kernel_cta.cu 2025-05-07T19:49:28.9564470Z Written: gen_embedding_backward_adam_split_unweighted_kernel_cta.cu 2025-05-07T19:49:28.9564980Z Written: gen_embedding_backward_adam_split_weighted_vbe_kernel_warp.cu 2025-05-07T19:49:28.9565465Z Written: gen_embedding_backward_adam_split_weighted_kernel_warp.cu 2025-05-07T19:49:28.9565961Z Written: gen_embedding_backward_adam_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:28.9566477Z Written: gen_embedding_backward_adam_split_unweighted_vbe_kernel_warp.cu 2025-05-07T19:49:28.9566979Z Written: gen_embedding_backward_adam_split_unweighted_kernel_warp.cu 2025-05-07T19:49:28.9567446Z Written: gen_embedding_optimizer_adam_split_device_kernel.cuh 2025-05-07T19:49:28.9567839Z Written: gen_embedding_backward_split_adam.cpp 2025-05-07T19:49:28.9568193Z Written: gen_embedding_split_adam_pt2_autograd.cpp 2025-05-07T19:49:28.9568601Z Written: gen_embedding_backward_split_adam_pt2_cuda_wrapper.cpp 2025-05-07T19:49:28.9568965Z Written: lookup_adam.py 2025-05-07T19:49:28.9569238Z Written: gen_embedding_backward_split_adam_cpu.cpp 2025-05-07T19:49:28.9569641Z Written: gen_embedding_backward_split_adam_pt2_cpu_wrapper.cpp 2025-05-07T19:49:28.9570066Z Written: gen_embedding_backward_lamb_split_weighted_cuda.cu 2025-05-07T19:49:28.9570520Z Written: gen_embedding_backward_lamb_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:28.9570978Z Written: gen_embedding_backward_lamb_split_unweighted_cuda.cu 2025-05-07T19:49:28.9571406Z Written: gen_embedding_backward_lamb_split_weighted_meta.cpp 2025-05-07T19:49:28.9571865Z Written: gen_embedding_backward_lamb_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:28.9572323Z Written: gen_embedding_backward_lamb_split_unweighted_meta.cpp 2025-05-07T19:49:28.9572785Z Written: gen_embedding_backward_lamb_split_weighted_kernel_cta.cu 2025-05-07T19:49:28.9573269Z Written: gen_embedding_backward_lamb_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:28.9573771Z Written: gen_embedding_backward_lamb_split_unweighted_kernel_cta.cu 2025-05-07T19:49:28.9574238Z Written: gen_embedding_backward_lamb_split_weighted_kernel_warp.cu 2025-05-07T19:49:28.9574731Z Written: gen_embedding_backward_lamb_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:28.9575242Z Written: gen_embedding_backward_lamb_split_unweighted_kernel_warp.cu 2025-05-07T19:49:28.9575704Z Written: gen_embedding_optimizer_lamb_split_device_kernel.cuh 2025-05-07T19:49:28.9576096Z Written: gen_embedding_backward_split_lamb.cpp 2025-05-07T19:49:28.9576439Z Written: gen_embedding_split_lamb_pt2_autograd.cpp 2025-05-07T19:49:28.9576847Z Written: gen_embedding_backward_split_lamb_pt2_cuda_wrapper.cpp 2025-05-07T19:49:28.9577216Z Written: lookup_lamb.py 2025-05-07T19:49:28.9577487Z Written: gen_embedding_backward_split_lamb_cpu.cpp 2025-05-07T19:49:28.9577894Z Written: gen_embedding_backward_split_lamb_pt2_cpu_wrapper.cpp 2025-05-07T19:49:28.9578330Z Written: gen_embedding_backward_lars_sgd_split_weighted_cuda.cu 2025-05-07T19:49:28.9578803Z Written: gen_embedding_backward_lars_sgd_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:28.9579280Z Written: gen_embedding_backward_lars_sgd_split_unweighted_cuda.cu 2025-05-07T19:49:28.9579735Z Written: gen_embedding_backward_lars_sgd_split_weighted_meta.cpp 2025-05-07T19:49:28.9580340Z Written: gen_embedding_backward_lars_sgd_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:28.9580832Z Written: gen_embedding_backward_lars_sgd_split_unweighted_meta.cpp 2025-05-07T19:49:28.9581312Z Written: gen_embedding_backward_lars_sgd_split_weighted_kernel_cta.cu 2025-05-07T19:49:28.9581829Z Written: gen_embedding_backward_lars_sgd_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:28.9582357Z Written: gen_embedding_backward_lars_sgd_split_unweighted_kernel_cta.cu 2025-05-07T19:49:28.9582986Z Written: gen_embedding_backward_lars_sgd_split_weighted_kernel_warp.cu 2025-05-07T19:49:28.9583517Z Written: gen_embedding_backward_lars_sgd_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:28.9584059Z Written: gen_embedding_backward_lars_sgd_split_unweighted_kernel_warp.cu 2025-05-07T19:49:28.9584541Z Written: gen_embedding_optimizer_lars_sgd_split_device_kernel.cuh 2025-05-07T19:49:28.9585137Z Written: gen_embedding_backward_split_lars_sgd.cpp 2025-05-07T19:49:28.9585508Z Written: gen_embedding_split_lars_sgd_pt2_autograd.cpp 2025-05-07T19:49:28.9585944Z Written: gen_embedding_backward_split_lars_sgd_pt2_cuda_wrapper.cpp 2025-05-07T19:49:28.9586319Z Written: lookup_lars_sgd.py 2025-05-07T19:49:28.9586625Z Written: gen_embedding_backward_split_lars_sgd_cpu.cpp 2025-05-07T19:49:28.9587095Z Written: gen_embedding_backward_split_lars_sgd_pt2_cpu_wrapper.cpp 2025-05-07T19:49:28.9587593Z Written: gen_embedding_backward_partial_rowwise_adam_split_weighted_cuda.cu 2025-05-07T19:49:28.9588161Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:28.9588725Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_cuda.cu 2025-05-07T19:49:28.9589270Z Written: gen_embedding_backward_partial_rowwise_adam_split_weighted_meta.cpp 2025-05-07T19:49:28.9589830Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:28.9590420Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_meta.cpp 2025-05-07T19:49:28.9590988Z Written: gen_embedding_backward_partial_rowwise_adam_split_weighted_kernel_cta.cu 2025-05-07T19:49:28.9591598Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:28.9592265Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_kernel_cta.cu 2025-05-07T19:49:28.9592844Z Written: gen_embedding_backward_partial_rowwise_adam_split_weighted_kernel_warp.cu 2025-05-07T19:49:28.9593460Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:28.9594075Z Written: gen_embedding_backward_partial_rowwise_adam_split_unweighted_kernel_warp.cu 2025-05-07T19:49:29.0792284Z Written: gen_embedding_optimizer_partial_rowwise_adam_split_device_kernel.cuh 2025-05-07T19:49:29.0793028Z Written: gen_embedding_backward_split_partial_rowwise_adam.cpp 2025-05-07T19:49:29.0793642Z Written: gen_embedding_split_partial_rowwise_adam_pt2_autograd.cpp 2025-05-07T19:49:29.0794316Z Written: gen_embedding_backward_split_partial_rowwise_adam_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.0794771Z Written: lookup_partial_rowwise_adam.py 2025-05-07T19:49:29.0795154Z Written: gen_embedding_backward_split_partial_rowwise_adam_cpu.cpp 2025-05-07T19:49:29.0795666Z Written: gen_embedding_backward_split_partial_rowwise_adam_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.0796204Z Written: gen_embedding_backward_partial_rowwise_lamb_split_weighted_cuda.cu 2025-05-07T19:49:29.0796771Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:29.0797334Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_cuda.cu 2025-05-07T19:49:29.0797880Z Written: gen_embedding_backward_partial_rowwise_lamb_split_weighted_meta.cpp 2025-05-07T19:49:29.0798449Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:29.0799021Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_meta.cpp 2025-05-07T19:49:29.0799596Z Written: gen_embedding_backward_partial_rowwise_lamb_split_weighted_kernel_cta.cu 2025-05-07T19:49:29.0800382Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:29.0800996Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_kernel_cta.cu 2025-05-07T19:49:29.0801572Z Written: gen_embedding_backward_partial_rowwise_lamb_split_weighted_kernel_warp.cu 2025-05-07T19:49:29.0802296Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:29.0802915Z Written: gen_embedding_backward_partial_rowwise_lamb_split_unweighted_kernel_warp.cu 2025-05-07T19:49:29.0803478Z Written: gen_embedding_optimizer_partial_rowwise_lamb_split_device_kernel.cuh 2025-05-07T19:49:29.0803978Z Written: gen_embedding_backward_split_partial_rowwise_lamb.cpp 2025-05-07T19:49:29.0804426Z Written: gen_embedding_split_partial_rowwise_lamb_pt2_autograd.cpp 2025-05-07T19:49:29.0804936Z Written: gen_embedding_backward_split_partial_rowwise_lamb_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.0805384Z Written: lookup_partial_rowwise_lamb.py 2025-05-07T19:49:29.0805759Z Written: gen_embedding_backward_split_partial_rowwise_lamb_cpu.cpp 2025-05-07T19:49:29.0806265Z Written: gen_embedding_backward_split_partial_rowwise_lamb_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.0806791Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_vbe_cuda.cu 2025-05-07T19:49:29.0807305Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_cuda.cu 2025-05-07T19:49:29.0807806Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_cuda.cu 2025-05-07T19:49:29.0808293Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_cuda.cu 2025-05-07T19:49:29.0808808Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_nobag_cuda.cu 2025-05-07T19:49:29.0809344Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:29.0809878Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_vbe_cuda.cu 2025-05-07T19:49:29.0810401Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_cuda.cu 2025-05-07T19:49:29.0810918Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_cuda.cu 2025-05-07T19:49:29.0811411Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_cuda.cu 2025-05-07T19:49:29.0811927Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_vbe_meta.cpp 2025-05-07T19:49:29.0812448Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_meta.cpp 2025-05-07T19:49:29.0812958Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_meta.cpp 2025-05-07T19:49:29.0813452Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_meta.cpp 2025-05-07T19:49:29.0813967Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_nobag_meta.cpp 2025-05-07T19:49:29.0814516Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:29.0815051Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_vbe_meta.cpp 2025-05-07T19:49:29.0815589Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_meta.cpp 2025-05-07T19:49:29.0816125Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_meta.cpp 2025-05-07T19:49:29.0816668Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_meta.cpp 2025-05-07T19:49:29.0817199Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.0817751Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.0818292Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_kernel_cta.cu 2025-05-07T19:49:29.0818811Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_kernel_cta.cu 2025-05-07T19:49:29.0819366Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:29.0819952Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:29.0820527Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.0821199Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.0821751Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_kernel_cta.cu 2025-05-07T19:49:29.0822291Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_kernel_cta.cu 2025-05-07T19:49:29.0822842Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.0823476Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.0824028Z Written: gen_embedding_backward_rowwise_adagrad_ssd_weighted_kernel_warp.cu 2025-05-07T19:49:29.0824558Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_kernel_warp.cu 2025-05-07T19:49:29.0825122Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:29.0825706Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:29.0826321Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.0826898Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.0827456Z Written: gen_embedding_backward_rowwise_adagrad_ssd_unweighted_kernel_warp.cu 2025-05-07T19:49:29.0828012Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_kernel_warp.cu 2025-05-07T19:49:29.0828597Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_gwd_kernel_cta.cu 2025-05-07T19:49:29.0829172Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_gwd_kernel_cta.cu 2025-05-07T19:49:29.0829758Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_gwd_kernel_cta.cu 2025-05-07T19:49:29.0830346Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_gwd_kernel_cta.cu 2025-05-07T19:49:29.0830932Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_gwd_kernel_warp.cu 2025-05-07T19:49:29.0831519Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_gwd_kernel_warp.cu 2025-05-07T19:49:29.0832198Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_gwd_kernel_warp.cu 2025-05-07T19:49:29.0832796Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_gwd_kernel_warp.cu 2025-05-07T19:49:29.0833353Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_vbe_gwd_cuda.cu 2025-05-07T19:49:29.0833892Z Written: gen_embedding_backward_rowwise_adagrad_split_weighted_gwd_cuda.cu 2025-05-07T19:49:29.0834423Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_vbe_gwd_cuda.cu 2025-05-07T19:49:29.0834976Z Written: gen_embedding_backward_rowwise_adagrad_split_unweighted_gwd_cuda.cu 2025-05-07T19:49:29.0835488Z Written: gen_embedding_optimizer_rowwise_adagrad_ssd_device_kernel.cuh 2025-05-07T19:49:29.0835984Z Written: gen_embedding_optimizer_rowwise_adagrad_split_device_kernel.cuh 2025-05-07T19:49:29.0836435Z Written: gen_embedding_backward_ssd_rowwise_adagrad.cpp 2025-05-07T19:49:29.0836844Z Written: gen_embedding_ssd_rowwise_adagrad_pt2_autograd.cpp 2025-05-07T19:49:29.0837348Z Written: gen_embedding_backward_ssd_rowwise_adagrad_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.0837756Z Written: lookup_rowwise_adagrad_ssd.py 2025-05-07T19:49:29.0838108Z Written: gen_embedding_backward_split_rowwise_adagrad.cpp 2025-05-07T19:49:29.0838519Z Written: gen_embedding_split_rowwise_adagrad_pt2_autograd.cpp 2025-05-07T19:49:29.0838998Z Written: gen_embedding_backward_split_rowwise_adagrad_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.0839411Z Written: lookup_rowwise_adagrad.py 2025-05-07T19:49:29.0839755Z Written: gen_embedding_backward_rowwise_adagrad_split_cpu.cpp 2025-05-07T19:49:29.0840187Z Written: gen_embedding_backward_split_rowwise_adagrad_cpu.cpp 2025-05-07T19:49:29.0840651Z Written: gen_embedding_backward_split_rowwise_adagrad_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.0841190Z Written: gen_embedding_optimizer_approx_rowwise_adagrad_split_device_kernel.cuh 2025-05-07T19:49:29.0841786Z Written: gen_embedding_backward_split_approx_rowwise_adagrad.cpp 2025-05-07T19:49:29.0842254Z Written: gen_embedding_split_approx_rowwise_adagrad_pt2_autograd.cpp 2025-05-07T19:49:29.0842773Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.0843296Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_cpu.cpp 2025-05-07T19:49:29.0843814Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.0844511Z Written: gen_embedding_optimizer_rowwise_adagrad_with_weight_decay_split_device_kernel.cuh 2025-05-07T19:49:29.0845098Z Written: gen_embedding_backward_split_rowwise_adagrad_with_weight_decay.cpp 2025-05-07T19:49:29.0845634Z Written: gen_embedding_split_rowwise_adagrad_with_weight_decay_pt2_autograd.cpp 2025-05-07T19:49:29.0846245Z Written: gen_embedding_backward_split_rowwise_adagrad_with_weight_decay_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.0846882Z Written: gen_embedding_backward_split_rowwise_adagrad_with_weight_decay_cpu.cpp 2025-05-07T19:49:29.0847477Z Written: gen_embedding_backward_split_rowwise_adagrad_with_weight_decay_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.0848141Z Written: gen_embedding_optimizer_approx_rowwise_adagrad_with_weight_decay_split_device_kernel.cuh 2025-05-07T19:49:29.0848767Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_weight_decay.cpp 2025-05-07T19:49:29.0849360Z Written: gen_embedding_split_approx_rowwise_adagrad_with_weight_decay_pt2_autograd.cpp 2025-05-07T19:49:29.0850009Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_weight_decay_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2234403Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_weight_decay_cpu.cpp 2025-05-07T19:49:29.2235343Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_weight_decay_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2236260Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_vbe_cuda.cu 2025-05-07T19:49:29.2236933Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_cuda.cu 2025-05-07T19:49:29.2237544Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:29.2238172Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_vbe_cuda.cu 2025-05-07T19:49:29.2238781Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_cuda.cu 2025-05-07T19:49:29.2239391Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_vbe_meta.cpp 2025-05-07T19:49:29.2240004Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_meta.cpp 2025-05-07T19:49:29.2240623Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:29.2241265Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_vbe_meta.cpp 2025-05-07T19:49:29.2241888Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_meta.cpp 2025-05-07T19:49:29.2242524Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.2243173Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_kernel_cta.cu 2025-05-07T19:49:29.2243827Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:29.2244509Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.2245166Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_kernel_cta.cu 2025-05-07T19:49:29.2245812Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.2246518Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_weighted_kernel_warp.cu 2025-05-07T19:49:29.2247181Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:29.2248050Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.2248715Z Written: gen_embedding_backward_rowwise_adagrad_with_counter_split_unweighted_kernel_warp.cu 2025-05-07T19:49:29.2249335Z Written: gen_embedding_optimizer_rowwise_adagrad_with_counter_split_device_kernel.cuh 2025-05-07T19:49:29.2249890Z Written: gen_embedding_backward_split_rowwise_adagrad_with_counter.cpp 2025-05-07T19:49:29.2250506Z Written: gen_embedding_split_rowwise_adagrad_with_counter_pt2_autograd.cpp 2025-05-07T19:49:29.2251077Z Written: gen_embedding_backward_split_rowwise_adagrad_with_counter_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2251568Z Written: lookup_rowwise_adagrad_with_counter.py 2025-05-07T19:49:29.2251997Z Written: gen_embedding_backward_split_rowwise_adagrad_with_counter_cpu.cpp 2025-05-07T19:49:29.2252563Z Written: gen_embedding_backward_split_rowwise_adagrad_with_counter_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2253191Z Written: gen_embedding_optimizer_approx_rowwise_adagrad_with_counter_split_device_kernel.cuh 2025-05-07T19:49:29.2253797Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_counter.cpp 2025-05-07T19:49:29.2254351Z Written: gen_embedding_split_approx_rowwise_adagrad_with_counter_pt2_autograd.cpp 2025-05-07T19:49:29.2254968Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_counter_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2255594Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_counter_cpu.cpp 2025-05-07T19:49:29.2256207Z Written: gen_embedding_backward_split_approx_rowwise_adagrad_with_counter_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2256876Z Written: gen_embedding_optimizer_rowwise_weighted_adagrad_split_device_kernel.cuh 2025-05-07T19:49:29.2257398Z Written: gen_embedding_backward_split_rowwise_weighted_adagrad.cpp 2025-05-07T19:49:29.2257883Z Written: gen_embedding_split_rowwise_weighted_adagrad_pt2_autograd.cpp 2025-05-07T19:49:29.2258424Z Written: gen_embedding_backward_split_rowwise_weighted_adagrad_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2258970Z Written: gen_embedding_backward_split_rowwise_weighted_adagrad_cpu.cpp 2025-05-07T19:49:29.2259509Z Written: gen_embedding_backward_split_rowwise_weighted_adagrad_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2260018Z Written: gen_embedding_backward_sgd_split_weighted_vbe_cuda.cu 2025-05-07T19:49:29.2260447Z Written: gen_embedding_backward_sgd_split_weighted_cuda.cu 2025-05-07T19:49:29.2260886Z Written: gen_embedding_backward_sgd_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:29.2261346Z Written: gen_embedding_backward_sgd_split_unweighted_vbe_cuda.cu 2025-05-07T19:49:29.2261787Z Written: gen_embedding_backward_sgd_split_unweighted_cuda.cu 2025-05-07T19:49:29.2262216Z Written: gen_embedding_backward_sgd_split_weighted_vbe_meta.cpp 2025-05-07T19:49:29.2262652Z Written: gen_embedding_backward_sgd_split_weighted_meta.cpp 2025-05-07T19:49:29.2263095Z Written: gen_embedding_backward_sgd_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:29.2263569Z Written: gen_embedding_backward_sgd_split_unweighted_vbe_meta.cpp 2025-05-07T19:49:29.2264014Z Written: gen_embedding_backward_sgd_split_unweighted_meta.cpp 2025-05-07T19:49:29.2264476Z Written: gen_embedding_backward_sgd_split_weighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.2264945Z Written: gen_embedding_backward_sgd_split_weighted_kernel_cta.cu 2025-05-07T19:49:29.2265424Z Written: gen_embedding_backward_sgd_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:29.2265938Z Written: gen_embedding_backward_sgd_split_unweighted_vbe_kernel_cta.cu 2025-05-07T19:49:29.2266416Z Written: gen_embedding_backward_sgd_split_unweighted_kernel_cta.cu 2025-05-07T19:49:29.2266946Z Written: gen_embedding_backward_sgd_split_weighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.2267431Z Written: gen_embedding_backward_sgd_split_weighted_kernel_warp.cu 2025-05-07T19:49:29.2267945Z Written: gen_embedding_backward_sgd_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:29.2268455Z Written: gen_embedding_backward_sgd_split_unweighted_vbe_kernel_warp.cu 2025-05-07T19:49:29.2269036Z Written: gen_embedding_backward_sgd_split_unweighted_kernel_warp.cu 2025-05-07T19:49:29.2269482Z Written: gen_embedding_optimizer_sgd_split_device_kernel.cuh 2025-05-07T19:49:29.2269866Z Written: gen_embedding_backward_split_sgd.cpp 2025-05-07T19:49:29.2270206Z Written: gen_embedding_split_sgd_pt2_autograd.cpp 2025-05-07T19:49:29.2270607Z Written: gen_embedding_backward_split_sgd_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2270959Z Written: lookup_sgd.py 2025-05-07T19:49:29.2271316Z Written: gen_embedding_backward_sgd_split_cpu.cpp 2025-05-07T19:49:29.2271671Z Written: gen_embedding_backward_split_sgd_cpu.cpp 2025-05-07T19:49:29.2272169Z Written: gen_embedding_backward_split_sgd_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2272629Z Written: gen_embedding_optimizer_approx_sgd_split_device_kernel.cuh 2025-05-07T19:49:29.2273050Z Written: gen_embedding_backward_split_approx_sgd.cpp 2025-05-07T19:49:29.2273437Z Written: gen_embedding_split_approx_sgd_pt2_autograd.cpp 2025-05-07T19:49:29.2273882Z Written: gen_embedding_backward_split_approx_sgd_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2274331Z Written: gen_embedding_backward_split_approx_sgd_cpu.cpp 2025-05-07T19:49:29.2274770Z Written: gen_embedding_backward_split_approx_sgd_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2275220Z Written: gen_embedding_backward_none_split_weighted_cuda.cu 2025-05-07T19:49:29.2275664Z Written: gen_embedding_backward_none_split_unweighted_nobag_cuda.cu 2025-05-07T19:49:29.2276121Z Written: gen_embedding_backward_none_split_unweighted_cuda.cu 2025-05-07T19:49:29.2276557Z Written: gen_embedding_backward_none_split_weighted_meta.cpp 2025-05-07T19:49:29.2277057Z Written: gen_embedding_backward_none_split_unweighted_nobag_meta.cpp 2025-05-07T19:49:29.2277519Z Written: gen_embedding_backward_none_split_unweighted_meta.cpp 2025-05-07T19:49:29.2277973Z Written: gen_embedding_backward_none_split_weighted_kernel_cta.cu 2025-05-07T19:49:29.2278460Z Written: gen_embedding_backward_none_split_unweighted_nobag_kernel_cta.cu 2025-05-07T19:49:29.2278970Z Written: gen_embedding_backward_none_split_unweighted_kernel_cta.cu 2025-05-07T19:49:29.2279437Z Written: gen_embedding_backward_none_split_weighted_kernel_warp.cu 2025-05-07T19:49:29.2279932Z Written: gen_embedding_backward_none_split_unweighted_nobag_kernel_warp.cu 2025-05-07T19:49:29.2280435Z Written: gen_embedding_backward_none_split_unweighted_kernel_warp.cu 2025-05-07T19:49:29.2280893Z Written: gen_embedding_optimizer_none_split_device_kernel.cuh 2025-05-07T19:49:29.2281289Z Written: gen_embedding_backward_split_none.cpp 2025-05-07T19:49:29.2281634Z Written: gen_embedding_split_none_pt2_autograd.cpp 2025-05-07T19:49:29.2282038Z Written: gen_embedding_backward_split_none_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.2282392Z Written: lookup_none.py 2025-05-07T19:49:29.2282672Z Written: gen_embedding_backward_split_none_cpu.cpp 2025-05-07T19:49:29.2283068Z Written: gen_embedding_backward_split_none_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.2283532Z Written: gen_embedding_backward_split_weighted_device_kernel_hip.hip 2025-05-07T19:49:29.2284041Z Written: gen_embedding_backward_split_unweighted_nobag_device_kernel_hip.hip 2025-05-07T19:49:29.2284568Z Written: gen_embedding_backward_split_unweighted_device_kernel_hip.hip 2025-05-07T19:49:29.2285221Z Written: gen_embedding_backward_ssd_weighted_vbe_device_kernel.cuh 2025-05-07T19:49:29.2285687Z Written: gen_embedding_backward_split_weighted_vbe_device_kernel.cuh 2025-05-07T19:49:29.2286154Z Written: gen_embedding_backward_ssd_weighted_device_kernel.cuh 2025-05-07T19:49:29.2286593Z Written: gen_embedding_backward_split_weighted_device_kernel.cuh 2025-05-07T19:49:29.2287125Z Written: gen_embedding_backward_ssd_unweighted_nobag_device_kernel.cuh 2025-05-07T19:49:29.2287623Z Written: gen_embedding_backward_split_unweighted_nobag_device_kernel.cuh 2025-05-07T19:49:29.2288121Z Written: gen_embedding_backward_ssd_unweighted_vbe_device_kernel.cuh 2025-05-07T19:49:29.2288613Z Written: gen_embedding_backward_split_unweighted_vbe_device_kernel.cuh 2025-05-07T19:49:29.2289218Z Written: gen_embedding_backward_ssd_unweighted_device_kernel.cuh 2025-05-07T19:49:29.2289679Z Written: gen_embedding_backward_split_unweighted_device_kernel.cuh 2025-05-07T19:49:29.2290127Z Written: gen_embedding_backward_split_common_device_kernel.cuh 2025-05-07T19:49:29.2290560Z Written: gen_embedding_backward_split_grad_embedding_ops.cu 2025-05-07T19:49:29.2291010Z Written: gen_embedding_backward_dense_indice_weights_codegen_cuda.cu 2025-05-07T19:49:29.2291600Z Written: gen_embedding_backward_ssd_indice_weights_codegen_cuda.cu 2025-05-07T19:49:29.2292077Z Written: gen_embedding_backward_split_indice_weights_codegen_cuda.cu 2025-05-07T19:49:29.2292453Z Written: pt2_arg_utils.h 2025-05-07T19:49:29.2292690Z Written: __init__.py 2025-05-07T19:49:29.2292915Z Written: lookup_args_ssd.py 2025-05-07T19:49:29.2293162Z Written: lookup_args.py 2025-05-07T19:49:29.2332605Z 2025-05-07T19:49:29.2332772Z 2025-05-07T19:49:29.2332956Z ================================================================================ 2025-05-07T19:49:29.2333410Z Running code generation script ... 2025-05-07T19:49:29.2334330Z /__w/_temp/conda_environment_14891846312/bin/python /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_embedding_optimizer.py --opensource 2025-05-07T19:49:29.2335084Z ================================================================================ 2025-05-07T19:49:29.2335299Z 2025-05-07T19:49:29.3477732Z [ARGS PARSE] Parsed arguments: Namespace(install_dir='.', is_fbcode=False, is_rocm=False) 2025-05-07T19:49:29.3478882Z [GENERATE OPTIMIZERS]: ['/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_embedding_optimizer.py', '--opensource'] 2025-05-07T19:49:29.3479853Z Written: gen_embedding_optimizer_rowwise_adagrad_split_cuda.cu 2025-05-07T19:49:29.3480343Z Written: gen_embedding_optimizer_rowwise_adagrad_split_kernel.cu 2025-05-07T19:49:29.3480780Z Written: gen_embedding_optimizer_rowwise_adagrad_split.cpp 2025-05-07T19:49:29.3481250Z Written: gen_embedding_optimizer_rowwise_adagrad_split_device_kernel.cuh 2025-05-07T19:49:29.3481703Z Written: split_embedding_optimizer_rowwise_adagrad.py 2025-05-07T19:49:29.3482032Z Written: optimizer_args.py 2025-05-07T19:49:29.3557075Z 2025-05-07T19:49:29.3557286Z 2025-05-07T19:49:29.3557471Z ================================================================================ 2025-05-07T19:49:29.3557888Z Running code generation script ... 2025-05-07T19:49:29.3558864Z /__w/_temp/conda_environment_14891846312/bin/python /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_forward_quantized.py --opensource 2025-05-07T19:49:29.3559608Z ================================================================================ 2025-05-07T19:49:29.3559815Z 2025-05-07T19:49:29.4926166Z [ARGS PARSE] Parsed arguments: Namespace(install_dir='.', is_fbcode=False, is_rocm=False) 2025-05-07T19:49:29.4927401Z [GENERATE FORWARD QUANTIZED]: ['/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_forward_quantized.py', '--opensource'] 2025-05-07T19:49:29.4928258Z Written: gen_embedding_forward_quantized_split_nbit_kernel_weighted_fp32_codegen_cuda.cu 2025-05-07T19:49:29.4928878Z Written: gen_embedding_forward_quantized_split_nbit_kernel_weighted_fp16_codegen_cuda.cu 2025-05-07T19:49:29.4929494Z Written: gen_embedding_forward_quantized_split_nbit_kernel_weighted_fp8_codegen_cuda.cu 2025-05-07T19:49:29.4930105Z Written: gen_embedding_forward_quantized_split_nbit_kernel_weighted_int8_codegen_cuda.cu 2025-05-07T19:49:29.4930727Z Written: gen_embedding_forward_quantized_split_nbit_kernel_weighted_int4_codegen_cuda.cu 2025-05-07T19:49:29.4931336Z Written: gen_embedding_forward_quantized_split_nbit_kernel_weighted_int2_codegen_cuda.cu 2025-05-07T19:49:29.4931982Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_nobag_fp32_codegen_cuda.cu 2025-05-07T19:49:29.4932660Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_nobag_fp16_codegen_cuda.cu 2025-05-07T19:49:29.4933520Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_nobag_fp8_codegen_cuda.cu 2025-05-07T19:49:29.4934190Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_nobag_int8_codegen_cuda.cu 2025-05-07T19:49:29.4934856Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_nobag_int4_codegen_cuda.cu 2025-05-07T19:49:29.4935527Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_nobag_int2_codegen_cuda.cu 2025-05-07T19:49:29.4936313Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_fp32_codegen_cuda.cu 2025-05-07T19:49:29.4936947Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_fp16_codegen_cuda.cu 2025-05-07T19:49:29.4937583Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_fp8_codegen_cuda.cu 2025-05-07T19:49:29.4938204Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_int8_codegen_cuda.cu 2025-05-07T19:49:29.4938833Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_int4_codegen_cuda.cu 2025-05-07T19:49:29.4939466Z Written: gen_embedding_forward_quantized_split_nbit_kernel_unweighted_int2_codegen_cuda.cu 2025-05-07T19:49:29.4940061Z Written: gen_embedding_forward_quantized_split_nbit_host_weighted_codegen_cuda.cu 2025-05-07T19:49:29.4940658Z Written: gen_embedding_forward_quantized_split_nbit_host_unweighted_nobag_codegen_cuda.cu 2025-05-07T19:49:29.4941260Z Written: gen_embedding_forward_quantized_split_nbit_host_unweighted_codegen_cuda.cu 2025-05-07T19:49:29.4941797Z Written: gen_embedding_forward_quantized_weighted_codegen_cpu.cpp 2025-05-07T19:49:29.4942261Z Written: gen_embedding_forward_quantized_unweighted_codegen_cpu.cpp 2025-05-07T19:49:29.5007543Z 2025-05-07T19:49:29.5007795Z 2025-05-07T19:49:29.5008007Z ================================================================================ 2025-05-07T19:49:29.5008450Z Running code generation script ... 2025-05-07T19:49:29.5009254Z /__w/_temp/conda_environment_14891846312/bin/python /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_forward_split.py --opensource 2025-05-07T19:49:29.5010001Z ================================================================================ 2025-05-07T19:49:29.5010209Z 2025-05-07T19:49:29.9388383Z [ARGS PARSE] Parsed arguments: Namespace(install_dir='.', is_fbcode=False, is_rocm=False) 2025-05-07T19:49:29.9389535Z [GENERATE FORWARD SPLIT]: ['/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_forward_split.py', '--opensource'] 2025-05-07T19:49:29.9390250Z Written: gen_embedding_forward_dense_weighted_vbe_codegen_cuda.cu 2025-05-07T19:49:29.9390685Z Written: gen_embedding_forward_dense_weighted_codegen_cuda.cu 2025-05-07T19:49:29.9391113Z Written: gen_embedding_forward_dense_unweighted_vbe_codegen_cuda.cu 2025-05-07T19:49:29.9391553Z Written: gen_embedding_forward_dense_unweighted_codegen_cuda.cu 2025-05-07T19:49:29.9392082Z Written: gen_embedding_forward_ssd_weighted_vbe_codegen_cuda.cu 2025-05-07T19:49:29.9392511Z Written: gen_embedding_forward_split_weighted_vbe_codegen_cuda.cu 2025-05-07T19:49:29.9392926Z Written: gen_embedding_forward_ssd_weighted_codegen_cuda.cu 2025-05-07T19:49:29.9393324Z Written: gen_embedding_forward_split_weighted_codegen_cuda.cu 2025-05-07T19:49:29.9393744Z Written: gen_embedding_forward_ssd_unweighted_vbe_codegen_cuda.cu 2025-05-07T19:49:29.9394177Z Written: gen_embedding_forward_split_unweighted_vbe_codegen_cuda.cu 2025-05-07T19:49:29.9394607Z Written: gen_embedding_forward_ssd_unweighted_codegen_cuda.cu 2025-05-07T19:49:29.9395023Z Written: gen_embedding_forward_split_unweighted_codegen_cuda.cu 2025-05-07T19:49:29.9395471Z Written: gen_embedding_forward_split_weighted_vbe_gwd_codegen_cuda.cu 2025-05-07T19:49:29.9395937Z Written: gen_embedding_forward_split_weighted_gwd_codegen_cuda.cu 2025-05-07T19:49:29.9396406Z Written: gen_embedding_forward_split_unweighted_vbe_gwd_codegen_cuda.cu 2025-05-07T19:49:29.9396892Z Written: gen_embedding_forward_split_unweighted_gwd_codegen_cuda.cu 2025-05-07T19:49:29.9397541Z Written: gen_embedding_forward_dense_weighted_vbe_codegen_meta.cpp 2025-05-07T19:49:29.9397992Z Written: gen_embedding_forward_dense_weighted_codegen_meta.cpp 2025-05-07T19:49:29.9398445Z Written: gen_embedding_forward_dense_unweighted_vbe_codegen_meta.cpp 2025-05-07T19:49:29.9398913Z Written: gen_embedding_forward_dense_unweighted_codegen_meta.cpp 2025-05-07T19:49:29.9399356Z Written: gen_embedding_forward_ssd_weighted_vbe_codegen_meta.cpp 2025-05-07T19:49:29.9400035Z Written: gen_embedding_forward_split_weighted_vbe_codegen_meta.cpp 2025-05-07T19:49:29.9400480Z Written: gen_embedding_forward_ssd_weighted_codegen_meta.cpp 2025-05-07T19:49:29.9400902Z Written: gen_embedding_forward_split_weighted_codegen_meta.cpp 2025-05-07T19:49:29.9401346Z Written: gen_embedding_forward_ssd_unweighted_vbe_codegen_meta.cpp 2025-05-07T19:49:29.9401808Z Written: gen_embedding_forward_split_unweighted_vbe_codegen_meta.cpp 2025-05-07T19:49:29.9402263Z Written: gen_embedding_forward_ssd_unweighted_codegen_meta.cpp 2025-05-07T19:49:29.9402713Z Written: gen_embedding_forward_split_unweighted_codegen_meta.cpp 2025-05-07T19:49:29.9403150Z Written: gen_embedding_forward_dense_weighted_vbe_kernel.cu 2025-05-07T19:49:29.9403550Z Written: gen_embedding_forward_dense_weighted_kernel.cu 2025-05-07T19:49:29.9403955Z Written: gen_embedding_forward_dense_unweighted_nobag_kernel.cu 2025-05-07T19:49:29.9404395Z Written: gen_embedding_forward_dense_unweighted_vbe_kernel.cu 2025-05-07T19:49:29.9404810Z Written: gen_embedding_forward_dense_unweighted_kernel.cu 2025-05-07T19:49:29.9405215Z Written: gen_embedding_forward_ssd_weighted_vbe_kernel.cu 2025-05-07T19:49:29.9405618Z Written: gen_embedding_forward_split_weighted_vbe_kernel.cu 2025-05-07T19:49:29.9406007Z Written: gen_embedding_forward_ssd_weighted_kernel.cu 2025-05-07T19:49:29.9406391Z Written: gen_embedding_forward_split_weighted_kernel.cu 2025-05-07T19:49:29.9406793Z Written: gen_embedding_forward_ssd_unweighted_nobag_kernel.cu 2025-05-07T19:49:29.9413154Z Written: gen_embedding_forward_split_unweighted_nobag_kernel.cu 2025-05-07T19:49:29.9413608Z Written: gen_embedding_forward_ssd_unweighted_vbe_kernel.cu 2025-05-07T19:49:29.9414031Z Written: gen_embedding_forward_split_unweighted_vbe_kernel.cu 2025-05-07T19:49:29.9414448Z Written: gen_embedding_forward_ssd_unweighted_kernel.cu 2025-05-07T19:49:29.9414844Z Written: gen_embedding_forward_split_unweighted_kernel.cu 2025-05-07T19:49:29.9415279Z Written: gen_embedding_forward_split_weighted_vbe_gwd_kernel.cu 2025-05-07T19:49:29.9415716Z Written: gen_embedding_forward_split_weighted_gwd_kernel.cu 2025-05-07T19:49:29.9416163Z Written: gen_embedding_forward_split_unweighted_vbe_gwd_kernel.cu 2025-05-07T19:49:29.9416615Z Written: gen_embedding_forward_split_unweighted_gwd_kernel.cu 2025-05-07T19:49:29.9417031Z Written: gen_embedding_forward_split_weighted_v2_kernel.cu 2025-05-07T19:49:29.9417448Z Written: gen_embedding_forward_split_unweighted_v2_kernel.cu 2025-05-07T19:49:29.9417897Z Written: gen_embedding_forward_dense_unweighted_nobag_kernel_small.cu 2025-05-07T19:49:29.9418390Z Written: gen_embedding_forward_dense_unweighted_nobag_kernel_small.cu 2025-05-07T19:49:29.9418861Z Written: gen_embedding_forward_ssd_unweighted_nobag_kernel_small.cu 2025-05-07T19:49:29.9419343Z Written: gen_embedding_forward_split_unweighted_nobag_kernel_small.cu 2025-05-07T19:49:29.9419793Z Written: gen_embedding_forward_split_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.9420196Z Written: gen_embedding_forward_split_pt2_cpu_wrapper.cpp 2025-05-07T19:49:29.9420589Z Written: gen_embedding_forward_ssd_pt2_cuda_wrapper.cpp 2025-05-07T19:49:29.9476379Z 2025-05-07T19:49:29.9476679Z 2025-05-07T19:49:29.9476901Z ================================================================================ 2025-05-07T19:49:29.9477336Z Running code generation script ... 2025-05-07T19:49:29.9478243Z /__w/_temp/conda_environment_14891846312/bin/python /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_index_select.py --opensource 2025-05-07T19:49:29.9478969Z ================================================================================ 2025-05-07T19:49:29.9479314Z 2025-05-07T19:49:30.2637483Z [ARGS PARSE] Parsed arguments: Namespace(install_dir='.', is_fbcode=False, is_rocm=False) 2025-05-07T19:49:30.2638601Z [INDEX SELECT GENERATOR]: ['/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/codegen/genscript/generate_index_select.py', '--opensource'] 2025-05-07T19:49:30.2639320Z Written: gen_batch_index_select_dim0_forward_codegen_cuda.cu 2025-05-07T19:49:30.2639897Z Written: gen_batch_index_select_dim0_forward_kernel.cu 2025-05-07T19:49:30.2640303Z Written: gen_batch_index_select_dim0_forward_kernel_small.cu 2025-05-07T19:49:30.2640717Z Written: gen_batch_index_select_dim0_backward_codegen_cuda.cu 2025-05-07T19:49:30.2641141Z Written: gen_batch_index_select_dim0_backward_kernel_cta.cu 2025-05-07T19:49:30.2641559Z Written: gen_batch_index_select_dim0_backward_kernel_warp.cu 2025-05-07T19:49:30.2642022Z Written: gen_embedding_backward_split_batch_index_select_device_kernel.cuh 2025-05-07T19:49:30.2642500Z Written: gen_embedding_backward_split_grad_index_select.cu 2025-05-07T19:49:30.2642918Z Written: gen_embedding_backward_split_common_device_kernel.cuh 2025-05-07T19:49:30.2847186Z 2025-05-07T19:49:30.2847346Z 2025-05-07T19:49:30.2847564Z ================================================================================ 2025-05-07T19:49:30.2848124Z GPU CPP Library Target: fbgemm_gpu_experimental_gen_ai (SHARED) 2025-05-07T19:49:30.2848579Z 2025-05-07T19:49:30.2848821Z CPU_SRCS: 2025-05-07T19:49:30.2849396Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/attention/attention.cpp 2025-05-07T19:49:30.2850323Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/coalesce/coalesce.cpp 2025-05-07T19:49:30.2851234Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cpp 2025-05-07T19:49:30.2852030Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/comm/car.cpp 2025-05-07T19:49:30.2852984Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cpp 2025-05-07T19:49:30.2853975Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/moe/index_shuffling.cpp 2025-05-07T19:49:30.2854880Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cpp 2025-05-07T19:49:30.2855500Z 2025-05-07T19:49:30.2855721Z GPU_SRCS: 2025-05-07T19:49:30.2856291Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/attention/gqa_attn_splitk.cu 2025-05-07T19:49:30.2857242Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/coalesce/coalesce.cu 2025-05-07T19:49:30.2858139Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu 2025-05-07T19:49:30.2859009Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/comm/car.cu 2025-05-07T19:49:30.2859922Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu 2025-05-07T19:49:30.2860890Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/moe/index_shuffling.cu 2025-05-07T19:49:30.2861795Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu 2025-05-07T19:49:30.2862395Z 2025-05-07T19:49:30.2862633Z CUDA_SPECIFIC_SRCS: 2025-05-07T19:49:30.2863316Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu 2025-05-07T19:49:30.2864466Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16.cu 2025-05-07T19:49:30.2865681Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu 2025-05-07T19:49:30.2866988Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu 2025-05-07T19:49:30.2868215Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16.cu 2025-05-07T19:49:30.2869696Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:49:30.2871047Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:49:30.2872557Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:49:30.2874058Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T19:49:30.2875440Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T19:49:30.2876468Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T19:49:30.2877492Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T19:49:30.2878808Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T19:49:30.2879808Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T19:49:30.2880823Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T19:49:30.2881829Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T19:49:30.2882880Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T19:49:30.2883883Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T19:49:30.2885144Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T19:49:30.2886147Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T19:49:30.2887155Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T19:49:30.2888148Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T19:49:30.2889201Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T19:49:30.2890203Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T19:49:30.2891204Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T19:49:30.2892200Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T19:49:30.2893202Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T19:49:30.2894305Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T19:49:30.2895663Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T19:49:30.2896897Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu 2025-05-07T19:49:30.2898281Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu 2025-05-07T19:49:30.2899365Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu 2025-05-07T19:49:30.2900294Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_lite.cu 2025-05-07T19:49:30.2901587Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu 2025-05-07T19:49:30.2902615Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu 2025-05-07T19:49:30.2903784Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu 2025-05-07T19:49:30.2904947Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu 2025-05-07T19:49:30.2906267Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu 2025-05-07T19:49:30.2907425Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu 2025-05-07T19:49:30.2908778Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu 2025-05-07T19:49:30.2910199Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu 2025-05-07T19:49:30.2911363Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu 2025-05-07T19:49:30.2912615Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu 2025-05-07T19:49:30.2914253Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu 2025-05-07T19:49:30.2915662Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu 2025-05-07T19:49:30.2916899Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu 2025-05-07T19:49:30.2918054Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu 2025-05-07T19:49:30.2919202Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu 2025-05-07T19:49:30.2920234Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu 2025-05-07T19:49:30.2921165Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu 2025-05-07T19:49:30.2922061Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu 2025-05-07T19:49:30.2922940Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu 2025-05-07T19:49:30.2923863Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu 2025-05-07T19:49:30.2924746Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/i8i8bf16.cu 2025-05-07T19:49:30.2925699Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu 2025-05-07T19:49:30.2926583Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu 2025-05-07T19:49:30.2927419Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/bf16_fast_gemv.cu 2025-05-07T19:49:30.2928312Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu 2025-05-07T19:49:30.2929147Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu 2025-05-07T19:49:30.2929970Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/fast_gemv.cu 2025-05-07T19:49:30.2930784Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/fast_gemv.cuh 2025-05-07T19:49:30.2931590Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/utility.cuh 2025-05-07T19:49:30.2932121Z 2025-05-07T19:49:30.2932292Z HIP_SPECIFIC_SRCS: 2025-05-07T19:49:30.2932711Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gemm/ck_extensions.hip 2025-05-07T19:49:30.2933349Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gemm/gemm.cpp 2025-05-07T19:49:30.2934128Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/bf16_grouped_gemm.hip 2025-05-07T19:49:30.2935348Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x32x128_16x16_1x1_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2936810Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x32x64_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.2938363Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x32x64_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.2939825Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x64x128_16x16_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_interwave_v1.hip 2025-05-07T19:49:30.2941283Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x64x128_16x16_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_interwave_v2.hip 2025-05-07T19:49:30.2944033Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x64x128_16x16_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_intrawave_v2.hip 2025-05-07T19:49:30.2945492Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x96x128_16x16_1x3_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.2946950Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x96x128_16x16_1x3_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2948412Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x96x128_16x16_1x3_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.2949868Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x96x128_16x16_1x3_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.2951317Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x16x96x64_16x16_1x3_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.2952968Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x32x16x64_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2954426Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x32x64x128_32x32_1x1_16x8x1_16x8x1_1x16x1x8_8x8x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2955964Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x32x64x128_32x32_1x1_16x8x1_16x8x1_1x16x1x8_8x8x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.2957423Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x32x96x128_16x16_2x3_16x8x1_16x8x1_1x32x1x4_8x8x1_2x1_intrawave_v2.hip 2025-05-07T19:49:30.2958876Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x64x128x64_32x32_2x2_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2960328Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_128x64x96x64_16x16_4x3_8x16x1_8x16x1_1x32x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2961802Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x128x128x128_32x32_2x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2963284Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x128x128x64_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.2964743Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x128x128x64_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2966212Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x128x224x64_16x16_4x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2967685Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x128x256x64_32x32_4x2_8x32x1_8x32x1_1x16x1x16_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2969149Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x128x96x64_16x16_4x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2970623Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x16x128x128_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.2972104Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x16x128x128_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8x8x1_1x2_intrawave_v2.hip 2025-05-07T19:49:30.2973580Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x16x64x128_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2975053Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x224x256x32_16x16_7x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.2976523Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x256x128x32_32x32_4x2_4x64x1_4x64x1_1x32x1x8_8x8x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.2978035Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x256x160x64_16x16_8x5_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2979613Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x256x192x64_32x32_4x3_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2981075Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x256x224x64_16x16_8x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2982608Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x256x256x64_32x32_4x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2984079Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x32x128x128_16x16_1x4_16x16x1_16x16x1_1x32x1x8_8x8x1_1x2_intrawave_v2.hip 2025-05-07T19:49:30.2985774Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x32x224x64_16x16_1x7_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.2987234Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x32x96x64_16x16_1x3_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.2988693Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x32x96x64_16x16_1x3_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2990164Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x64x128x128_32x32_2x1_16x16x1_16x16x1_1x16x1x16_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.2991646Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x64x192x128_16x16_4x3_16x16x1_16x16x1_1x32x1x8_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2993166Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_256x64x96x64_16x16_2x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.2994617Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_64x16x16x128_16x16_1x1_16x4x1_16x4x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2996078Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_64x16x16x128_16x16_1x1_16x4x1_16x4x1_1x16x1x4_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.2997517Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_64x16x16x64_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.2998963Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_64x16x32x128_16x16_1x2_16x4x1_16x4x1_1x16x1x4_8x8x1_1x2_intrawave_v2.hip 2025-05-07T19:49:30.3000418Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_64x16x48x128_16x16_1x3_16x4x1_16x4x1_1x16x1x4_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3001868Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/bf16_grouped/kernels/bf16_grouped_64x16x64x128_16x16_1x4_16x4x1_16x4x1_1x16x1x4_8x8x1_1x2_intrawave_v2.hip 2025-05-07T19:49:30.3002996Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/ck_utility.hip 2025-05-07T19:49:30.3003826Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_blockwise_gemm.hip 2025-05-07T19:49:30.3004727Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/fp8_rowwise_gemm.hip 2025-05-07T19:49:30.3006073Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x128x16x128_16x16_4x1_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3007514Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x128x32x128_32x32_2x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3009707Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3011191Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2_4_split_k.hip 2025-05-07T19:49:30.3012696Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2_8_split_k.hip 2025-05-07T19:49:30.3014176Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3015621Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3017090Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2_8_split_k.hip 2025-05-07T19:49:30.3018565Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x256_16x16_1x1_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3020009Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3021475Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_interwave_v2_2_split_k.hip 2025-05-07T19:49:30.3022953Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3024425Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_intrawave_v2_2_split_k.hip 2025-05-07T19:49:30.3025899Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x16x32x512_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3027344Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x128x128_32x32_1x2_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3028797Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3030235Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3031681Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x16x256_16x16_1x1_16x8x1_16x8x1_1x32x1x4_4x4x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3033291Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x16x512_16x16_1x1_32x4x1_32x4x1_1x32x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3034823Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x16x512_16x16_1x1_32x4x1_32x4x1_1x32x1x4_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3036266Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x64x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3037759Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x32x64x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3039193Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x64x32x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3040646Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_128x64x32x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3042096Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x128x128_16x16_4x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3043548Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3045001Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3046459Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v5.hip 2025-05-07T19:49:30.3047919Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x128x256_32x32_2x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3049375Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x128x64_32x32_2x2_4x64x1_4x64x1_1x32x1x8_8x8x1_1x1_intrawave_v4.hip 2025-05-07T19:49:30.3050825Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x160x128_16x16_4x5_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3052280Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x160x128_32x32_1x5_8x32x1_8x32x1_1x64x1x4_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3053736Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x192x128_32x32_2x3_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3055194Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x256x128_32x32_2x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3056642Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x64x128_32x32_2x1_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3058095Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x64x256_32x32_2x1_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3059639Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x96x128_16x16_4x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3061186Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x128x96x256_32x32_1x3_16x16x1_16x16x1_1x64x1x4_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3062644Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x160x128x128_16x16_5x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3064098Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x160x256x128_16x16_5x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3065546Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x160x96x128_16x16_5x3_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3067072Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x16x64x128_16x16_1x1_16x16x1_8x32x1_1x16x1x16_4x4x1_1x1_intrawave_v2_8_split_k.hip 2025-05-07T19:49:30.3068566Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x16x64x512_16x16_1x1_32x8x1_32x8x1_1x16x1x16_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3070001Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x16x64x512_16x16_1x1_32x8x1_32x8x1_1x16x1x16_4x4x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3071451Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x192x128x128_16x16_6x4_8x32x1_8x32x1_1x32x1x8_8x8x1_2x2_intrawave_v3.hip 2025-05-07T19:49:30.3072972Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x192x192x128_16x16_6x6_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3074426Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x192x224x128_16x16_6x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3075881Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x192x256x128_16x16_6x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3077381Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x192x256x128_16x16_6x8_8x32x1_8x32x1_1x32x1x8_8x8x1_2x2_intrawave_v3.hip 2025-05-07T19:49:30.3078833Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x224x160x128_16x16_7x5_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3080286Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x224x192x128_16x16_7x6_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3081743Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x224x256x128_16x16_7x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3083190Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x128x128_16x16_8x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3084641Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x128x128_32x32_4x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3086511Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x160x128_16x16_8x5_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3088106Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x192x128_16x16_8x6_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3089566Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x192x128_32x32_4x3_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3091018Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x224x128_16x16_8x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3092478Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x256x128_16x16_8x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3093920Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x256x64_16x16_8x8_4x64x1_4x64x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3095610Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x256x64_32x32_4x4_4x64x1_4x64x1_1x32x1x8_8x8x1_1x1_intrawave_v4.hip 2025-05-07T19:49:30.3097272Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x96x128_16x16_8x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3098714Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x256x96x128_32x32_2x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3100169Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x32x128x256_32x32_1x1_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3101622Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x32x64x512_16x16_1x2_32x8x1_32x8x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3103055Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x128x128_32x32_1x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3104503Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x128x256_32x32_1x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3105956Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x16x512_16x16_1x1_32x8x1_32x8x1_1x64x1x4_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3107396Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x192x128_32x32_1x3_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3108849Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x192x256_32x32_1x3_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3110295Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x256x128_32x32_1x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3111731Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x64x128_32x32_1x1_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3113380Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x64x512_32x32_1x1_32x8x1_32x8x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3114905Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x64x96x256_16x16_2x3_16x16x1_16x16x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3116365Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x80x128x256_16x16_5x2_16x16x1_16x16x1_1x16x1x16_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3117815Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_256x96x128x128_16x16_3x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3119253Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x128_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4x4x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3120667Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x128_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3122096Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x16x1x4_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3123530Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x4x1x16_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3124958Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x512_16x16_1x1_32x2x1_32x2x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3126389Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x512_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3127865Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise/kernels/fp8_rowwise_64x16x16x64_16x16_1x1_4x16x1_4x16x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3129103Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/fp8_rowwise_batched_gemm.hip 2025-05-07T19:49:30.3130415Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x256_16x16_1x1_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3131978Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x256_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3133527Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x256_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3135086Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x256_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3136637Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x256_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3138183Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3139823Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x512_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3141456Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x16x32x512_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3143010Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x32x128x128_32x32_1x2_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3144566Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_128x32x64x128_32x32_1x1_8x16x1_8x16x1_1x16x1x8_8x8x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3146129Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3147746Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v4.hip 2025-05-07T19:49:30.3149314Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v5.hip 2025-05-07T19:49:30.3150885Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x128x256_32x32_2x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3152511Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x160x128_32x32_1x5_8x32x1_8x32x1_1x64x1x4_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3154078Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x192x128_32x32_2x3_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3155645Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x256x128_32x32_2x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3157209Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x64x128_32x32_2x1_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3158827Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x128x96x256_32x32_1x3_16x16x1_16x16x1_1x64x1x4_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3160390Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x16x64x512_16x16_1x1_32x8x1_32x8x1_1x16x1x16_4x4x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3161961Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x224x256x128_16x16_7x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3163529Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x256x128x128_16x16_8x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3165185Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x256x160x128_16x16_8x5_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3166753Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x256x192x128_16x16_8x6_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3168413Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x256x224x128_16x16_8x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3169982Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x256x256x128_16x16_8x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3171550Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x32x128x256_32x32_1x1_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3173114Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x32x64x512_16x16_1x2_32x8x1_32x8x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3174679Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x64x128x256_32x32_1x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3176251Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x64x192x256_32x32_1x3_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3177858Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x64x64x128_32x32_1x1_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3179415Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_256x64x64x512_32x32_1x1_32x8x1_32x8x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3180969Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_64x16x16x512_16x16_1x1_32x2x1_32x2x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3182513Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_64x16x16x512_16x16_1x1_32x2x1_32x2x1_1x16x1x4_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3184043Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_64x16x16x512_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4_1x1_interwave_v1.hip 2025-05-07T19:49:30.3185799Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_batched/kernels/fp8_rowwise_batched_64x16x16x512_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4_1x1_interwave_v2.hip 2025-05-07T19:49:30.3187125Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/fp8_rowwise_grouped_gemm.hip 2025-05-07T19:49:30.3188457Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3190017Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x32x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3191566Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x32x256_16x16_1x1_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3193333Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3194998Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3196551Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x32x512_16x16_1x1_32x4x1_32x4x1_1x16x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3198109Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x64x256_16x16_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_interwave_v1.hip 2025-05-07T19:49:30.3199671Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x64x256_16x16_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_interwave_v2.hip 2025-05-07T19:49:30.3201225Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x64x256_16x16_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.3202779Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x16x96x256_16x16_1x3_16x8x1_16x8x1_1x16x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3204335Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x32x16x128_16x16_1x1_8x16x1_8x16x1_1x16x1x8_2x2x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3205894Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x32x64x256_16x16_1x4_16x8x1_16x8x1_1x32x1x4_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.3207452Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x32x64x256_32x32_1x1_16x8x1_16x8x1_1x16x1x8_8x8x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3209005Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x64x64x256_32x32_1x2_16x8x1_16x8x1_1x16x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3210558Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_128x64x64x256_32x32_2x1_16x8x1_16x8x1_1x16x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3212127Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3213696Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x128x128x128_32x32_2x2_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3215274Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x128x128x256_32x32_2x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3216845Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x128x224x128_16x16_4x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3218504Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x128x256x128_32x32_4x2_8x32x1_8x32x1_1x16x1x16_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3220072Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x128x96x128_16x16_4x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3221712Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x128x256_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.3223292Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x128x256_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8x8x1_1x2_intrawave_v2.hip 2025-05-07T19:49:30.3224870Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x128x256_16x16_1x2_16x16x1_16x16x1_1x16x1x16_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3226446Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x64x256_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3228021Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x64x256_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3229630Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x64x256_16x16_1x1_16x16x1_16x16x1_1x16x1x16_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3231289Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x16x64x512_16x16_1x1_32x8x1_32x8x1_1x16x1x16_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3232938Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x192x96x128_16x16_6x3_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3234514Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x224x256x128_16x16_7x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v3.hip 2025-05-07T19:49:30.3236077Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x256x128x64_32x32_4x2_4x64x1_4x64x1_1x32x1x8_8x8x1_1x1_interwave_v1.hip 2025-05-07T19:49:30.3237644Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x256x160x128_32x32_2x5_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3239225Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x256x192x128_32x32_4x3_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3240798Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x256x224x128_16x16_8x7_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3242367Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x256x256x128_32x32_4x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3243940Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x256x256x128_32x32_8x2_8x32x1_8x32x1_1x16x1x16_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3245605Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x128x128_16x16_1x4_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_interwave_v2.hip 2025-05-07T19:49:30.3247235Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x160x128_16x16_1x5_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3248806Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x160x128_16x16_1x5_8x32x1_8x32x1_1x32x1x8_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3250370Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x256x128_16x16_1x8_8x32x1_8x32x1_1x32x1x8_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.3251931Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x32x512_16x16_1x1_32x8x1_32x8x1_1x32x1x8_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3253488Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x32x512_16x16_1x1_32x8x1_32x8x1_1x32x1x8_4x4x1_1x1_intrawave_v2.hip 2025-05-07T19:49:30.3255047Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x32x64x512_16x16_2x1_32x8x1_32x8x1_1x32x1x8_8x8x1_2x1_intrawave_v2.hip 2025-05-07T19:49:30.3256608Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x64x128x256_32x32_1x2_16x16x1_16x16x1_1x32x1x8_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3258194Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x64x128x256_32x32_2x1_16x16x1_16x16x1_1x16x1x16_8x8x1_1x1_intrawave_v3.hip 2025-05-07T19:49:30.3259766Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x64x160x128_16x16_2x5_8x32x1_8x32x1_1x64x1x4_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3261327Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_256x64x192x128_16x16_4x3_8x32x1_8x32x1_1x32x1x8_8x8x1_2x1_intrawave_v3.hip 2025-05-07T19:49:30.3262878Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_64x16x16x128_16x16_1x1_8x8x1_8x8x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3264424Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x16x1x4_4x4x1_1x1_interwave_v2.hip 2025-05-07T19:49:30.3265974Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_64x16x16x256_16x16_1x1_16x4x1_16x4x1_1x16x1x4_4x4x1_1x1_intrawave_v1.hip 2025-05-07T19:49:30.3267526Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_64x16x32x256_16x16_1x2_16x4x1_16x4x1_1x16x1x4_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.3269074Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_64x16x64x256_16x16_1x4_16x4x1_16x4x1_1x16x1x4_8x8x1_1x2_interwave_v1.hip 2025-05-07T19:49:30.3270616Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped_64x16x64x256_16x16_1x4_16x4x1_16x4x1_1x16x1x4_8x8x1_1x2_intrawave_v1.hip 2025-05-07T19:49:30.3272010Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/ck_extensions/fp8_tensorwise_gemm.hip 2025-05-07T19:49:30.3272582Z 2025-05-07T19:49:30.3272743Z OTHER_SRCS: 2025-05-07T19:49:30.3272856Z 2025-05-07T19:49:30.3272924Z 2025-05-07T19:49:30.3273079Z CC_FLAGS: 2025-05-07T19:49:30.3273185Z 2025-05-07T19:49:30.3273248Z 2025-05-07T19:49:30.3273403Z NVCC_FLAGS: 2025-05-07T19:49:30.3273597Z 2025-05-07T19:49:30.3273662Z 2025-05-07T19:49:30.3273817Z HIPCC_FLAGS: 2025-05-07T19:49:30.3273935Z 2025-05-07T19:49:30.3273999Z 2025-05-07T19:49:30.3274156Z INCLUDE_DIRS: 2025-05-07T19:49:30.3274408Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:30.3274758Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:49:30.3275092Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include 2025-05-07T19:49:30.3275466Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:30.3275927Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include 2025-05-07T19:49:30.3276578Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include 2025-05-07T19:49:30.3277186Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src 2025-05-07T19:49:30.3277687Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include 2025-05-07T19:49:30.3278214Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include 2025-05-07T19:49:30.3278774Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include 2025-05-07T19:49:30.3279373Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include 2025-05-07T19:49:30.3279916Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include 2025-05-07T19:49:30.3280459Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include 2025-05-07T19:49:30.3281041Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize 2025-05-07T19:49:30.3281431Z 2025-05-07T19:49:30.3281610Z Selected Source Files: 2025-05-07T19:49:30.3282038Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/attention/attention.cpp 2025-05-07T19:49:30.3282710Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/coalesce/coalesce.cpp 2025-05-07T19:49:30.3290453Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cpp 2025-05-07T19:49:30.3291262Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/comm/car.cpp 2025-05-07T19:49:30.3292178Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cpp 2025-05-07T19:49:30.3292916Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/moe/index_shuffling.cpp 2025-05-07T19:49:30.3293581Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cpp 2025-05-07T19:49:30.3294274Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/attention/gqa_attn_splitk.cu 2025-05-07T19:49:30.3294962Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/coalesce/coalesce.cu 2025-05-07T19:49:30.3295606Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu 2025-05-07T19:49:30.3296228Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/comm/car.cu 2025-05-07T19:49:30.3296912Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu 2025-05-07T19:49:30.3297669Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/moe/index_shuffling.cu 2025-05-07T19:49:30.3298326Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu 2025-05-07T19:49:30.3299099Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu 2025-05-07T19:49:30.3300151Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16.cu 2025-05-07T19:49:30.3301053Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu 2025-05-07T19:49:30.3302013Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu 2025-05-07T19:49:30.3303014Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16.cu 2025-05-07T19:49:30.3303920Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:49:30.3304916Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:49:30.3305911Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:49:30.3306906Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T19:49:30.3307897Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T19:49:30.3308882Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T19:49:30.3309883Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T19:49:30.3310880Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T19:49:30.3311869Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T19:49:30.3312953Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T19:49:30.3313946Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T19:49:30.3314935Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T19:49:30.3315930Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T19:49:30.3316964Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T19:49:30.3317956Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T19:49:30.3318945Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T19:49:30.3319941Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T19:49:30.3320930Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T19:49:30.3321923Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T19:49:30.3322908Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T19:49:30.3323893Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T19:49:30.3324878Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T19:49:30.3325972Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T19:49:30.3326959Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T19:49:30.3327854Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu 2025-05-07T19:49:30.3328782Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu 2025-05-07T19:49:30.3329660Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu 2025-05-07T19:49:30.3330514Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_lite.cu 2025-05-07T19:49:30.3331374Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu 2025-05-07T19:49:30.3332393Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu 2025-05-07T19:49:30.3333562Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu 2025-05-07T19:49:30.3334729Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu 2025-05-07T19:49:30.3335884Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu 2025-05-07T19:49:30.3337354Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu 2025-05-07T19:49:30.3338544Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu 2025-05-07T19:49:30.3339696Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu 2025-05-07T19:49:30.3340858Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu 2025-05-07T19:49:30.3342112Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu 2025-05-07T19:49:30.3343417Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu 2025-05-07T19:49:30.3344816Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu 2025-05-07T19:49:30.3346045Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu 2025-05-07T19:49:30.3347247Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu 2025-05-07T19:49:30.3348398Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu 2025-05-07T19:49:30.3349426Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu 2025-05-07T19:49:30.3350346Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu 2025-05-07T19:49:30.3351227Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu 2025-05-07T19:49:30.3352320Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu 2025-05-07T19:49:30.3353238Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu 2025-05-07T19:49:30.3354117Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/i8i8bf16.cu 2025-05-07T19:49:30.3355053Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu 2025-05-07T19:49:30.3355929Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu 2025-05-07T19:49:30.3356767Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/bf16_fast_gemv.cu 2025-05-07T19:49:30.3357578Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu 2025-05-07T19:49:30.3358414Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu 2025-05-07T19:49:30.3359236Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/fast_gemv.cu 2025-05-07T19:49:30.3360041Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/fast_gemv.cuh 2025-05-07T19:49:30.3360856Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/utility.cuh 2025-05-07T19:49:30.3361372Z 2025-05-07T19:49:30.3361543Z HIPified Source Files: 2025-05-07T19:49:30.3361687Z 2025-05-07T19:49:30.3361757Z 2025-05-07T19:49:30.3361924Z Library Dependencies: 2025-05-07T19:49:30.3362131Z torch 2025-05-07T19:49:30.3362295Z torch_library 2025-05-07T19:49:30.3362668Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so 2025-05-07T19:49:30.3363116Z /usr/local/cuda-12.8/lib64/libnvrtc.so 2025-05-07T19:49:30.3363576Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so 2025-05-07T19:49:30.3364227Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 2025-05-07T19:49:30.3364680Z /usr/lib64/libcuda.so 2025-05-07T19:49:30.3365007Z /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:30.3365350Z 2025-05-07T19:49:30.3365519Z Output Library: 2025-05-07T19:49:30.3365739Z fbgemm_gpu_experimental_gen_ai 2025-05-07T19:49:30.3365972Z 2025-05-07T19:49:30.3366139Z Destination Directory: 2025-05-07T19:49:30.3366284Z 2025-05-07T19:49:30.3366386Z ================================================================================ 2025-05-07T19:49:30.3366596Z 2025-05-07T19:49:30.3366600Z 2025-05-07T19:49:30.3366604Z 2025-05-07T19:49:30.3366712Z ================================================================================ 2025-05-07T19:49:30.3367047Z Adding to Package: fbgemm_gpu/experimental/gen_ai 2025-05-07T19:49:30.3367338Z 2025-05-07T19:49:30.3367493Z TARGETS: 2025-05-07T19:49:30.3367686Z fbgemm_gpu_experimental_gen_ai 2025-05-07T19:49:30.3367912Z 2025-05-07T19:49:30.3368069Z FILES: 2025-05-07T19:49:30.3368168Z 2025-05-07T19:49:30.3368270Z ================================================================================ 2025-05-07T19:49:30.3368475Z 2025-05-07T19:49:30.3368478Z 2025-05-07T19:49:30.3368482Z 2025-05-07T19:49:30.3368586Z ================================================================================ 2025-05-07T19:49:30.3368971Z GPU CPP Library Target: fbgemm_gpu_experimental_example_py (SHARED) 2025-05-07T19:49:30.3369309Z 2025-05-07T19:49:30.3369470Z CPU_SRCS: 2025-05-07T19:49:30.3369574Z 2025-05-07T19:49:30.3369637Z 2025-05-07T19:49:30.3369793Z GPU_SRCS: 2025-05-07T19:49:30.3370162Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/example_nccl.cpp 2025-05-07T19:49:30.3370788Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/example_ops.cpp 2025-05-07T19:49:30.3371551Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/cutlass_sgemm_nn.cu 2025-05-07T19:49:30.3371989Z 2025-05-07T19:49:30.3372157Z CUDA_SPECIFIC_SRCS: 2025-05-07T19:49:30.3372285Z 2025-05-07T19:49:30.3372347Z 2025-05-07T19:49:30.3372512Z HIP_SPECIFIC_SRCS: 2025-05-07T19:49:30.3372639Z 2025-05-07T19:49:30.3372702Z 2025-05-07T19:49:30.3372861Z OTHER_SRCS: 2025-05-07T19:49:30.3372966Z 2025-05-07T19:49:30.3373112Z 2025-05-07T19:49:30.3373265Z CC_FLAGS: 2025-05-07T19:49:30.3373368Z 2025-05-07T19:49:30.3373435Z 2025-05-07T19:49:30.3373583Z NVCC_FLAGS: 2025-05-07T19:49:30.3373694Z 2025-05-07T19:49:30.3373757Z 2025-05-07T19:49:30.3373910Z HIPCC_FLAGS: 2025-05-07T19:49:30.3374027Z 2025-05-07T19:49:30.3374088Z 2025-05-07T19:49:30.3374241Z INCLUDE_DIRS: 2025-05-07T19:49:30.3374498Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:30.3374846Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T19:49:30.3375184Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include 2025-05-07T19:49:30.3375552Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include 2025-05-07T19:49:30.3376003Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include 2025-05-07T19:49:30.3376644Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include 2025-05-07T19:49:30.3377255Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src 2025-05-07T19:49:30.3377755Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include 2025-05-07T19:49:30.3378271Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include 2025-05-07T19:49:30.3378821Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include 2025-05-07T19:49:30.3379421Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include 2025-05-07T19:49:30.3379959Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include 2025-05-07T19:49:30.3380505Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include 2025-05-07T19:49:30.3380912Z 2025-05-07T19:49:30.3381081Z Selected Source Files: 2025-05-07T19:49:30.3381483Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/example_nccl.cpp 2025-05-07T19:49:30.3382107Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/example_ops.cpp 2025-05-07T19:49:30.3382751Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/cutlass_sgemm_nn.cu 2025-05-07T19:49:30.3383188Z 2025-05-07T19:49:30.3383359Z HIPified Source Files: 2025-05-07T19:49:30.3383501Z 2025-05-07T19:49:30.3383563Z 2025-05-07T19:49:30.3383731Z Library Dependencies: 2025-05-07T19:49:30.3383929Z torch 2025-05-07T19:49:30.3384097Z torch_library 2025-05-07T19:49:30.3384454Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so 2025-05-07T19:49:30.3385160Z /usr/local/cuda-12.8/lib64/libnvrtc.so 2025-05-07T19:49:30.3385620Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so 2025-05-07T19:49:30.3386270Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 2025-05-07T19:49:30.3386728Z /usr/lib64/libcuda.so 2025-05-07T19:49:30.3387047Z /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so 2025-05-07T19:49:30.3387393Z 2025-05-07T19:49:30.3387553Z Output Library: 2025-05-07T19:49:30.3387779Z fbgemm_gpu_experimental_example_py 2025-05-07T19:49:30.3388025Z 2025-05-07T19:49:30.3388196Z Destination Directory: 2025-05-07T19:49:30.3388336Z 2025-05-07T19:49:30.3388437Z ================================================================================ 2025-05-07T19:49:30.3388643Z 2025-05-07T19:49:30.3388647Z 2025-05-07T19:49:30.3388650Z 2025-05-07T19:49:30.3388749Z ================================================================================ 2025-05-07T19:49:30.3389245Z Adding to Package: fbgemm_gpu/experimental/example 2025-05-07T19:49:30.3389531Z 2025-05-07T19:49:30.3389688Z TARGETS: 2025-05-07T19:49:30.3389893Z fbgemm_gpu_experimental_example_py 2025-05-07T19:49:30.3390135Z 2025-05-07T19:49:30.3390289Z FILES: 2025-05-07T19:49:30.3390648Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/example/__init__.py 2025-05-07T19:49:30.3391265Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/example/utils.py 2025-05-07T19:49:30.3391825Z ================================================================================ 2025-05-07T19:49:30.3392108Z 2025-05-07T19:49:30.3392140Z 2025-05-07T19:49:30.3392144Z 2025-05-07T19:49:30.3392243Z ================================================================================ 2025-05-07T19:49:30.3392611Z Adding to Package: fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T19:49:30.3392930Z 2025-05-07T19:49:30.3393087Z TARGETS: 2025-05-07T19:49:30.3393187Z 2025-05-07T19:49:30.3393249Z 2025-05-07T19:49:30.3393403Z FILES: 2025-05-07T19:49:30.3393763Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gemm/triton_gemm/__init__.py 2025-05-07T19:49:30.3394392Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py 2025-05-07T19:49:30.3395035Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py 2025-05-07T19:49:30.3395717Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gemm/triton_gemm/matmul_perf_model.py 2025-05-07T19:49:30.3396378Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gemm/triton_gemm/utils.py 2025-05-07T19:49:30.3396830Z ================================================================================ 2025-05-07T19:49:30.3397037Z 2025-05-07T19:49:30.3397373Z CMake Warning at /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/cmake/modules/GpuCppLibrary.cmake:246 (add_library): 2025-05-07T19:49:30.3398016Z Cannot generate a safe runtime search path for target asmjit because files 2025-05-07T19:49:30.3398524Z in some directories may conflict with libraries in implicit directories: 2025-05-07T19:49:30.3398833Z 2025-05-07T19:49:30.3399282Z runtime library [libnccl.so.2] in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib may be hidden by files in: 2025-05-07T19:49:30.3399905Z /usr/local/cuda-12.8/lib64 2025-05-07T19:49:30.3400078Z 2025-05-07T19:49:30.3400208Z Some of these libraries may not be found correctly. 2025-05-07T19:49:30.3400529Z Call Stack (most recent call first): 2025-05-07T19:49:30.3400814Z cmake/Asmjit.cmake:19 (gpu_cpp_library) 2025-05-07T19:49:30.3401102Z CMakeLists.txt:154 (include) 2025-05-07T19:49:30.3401268Z 2025-05-07T19:49:30.3401271Z 2025-05-07T19:49:30.3401607Z CMake Warning at /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/cmake/modules/GpuCppLibrary.cmake:246 (add_library): 2025-05-07T19:49:30.3402188Z Cannot generate a safe runtime search path for target 2025-05-07T19:49:30.3402618Z fbgemm_gpu_experimental_gen_ai because files in some directories may 2025-05-07T19:49:30.3403047Z conflict with libraries in implicit directories: 2025-05-07T19:49:30.3403270Z 2025-05-07T19:49:30.3403717Z runtime library [libnccl.so.2] in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib may be hidden by files in: 2025-05-07T19:49:30.3404338Z /usr/local/cuda-12.8/lib64 2025-05-07T19:49:30.3404510Z 2025-05-07T19:49:30.3404638Z Some of these libraries may not be found correctly. 2025-05-07T19:49:30.3404958Z Call Stack (most recent call first): 2025-05-07T19:49:30.3405295Z experimental/gen_ai/CMakeLists.txt:81 (gpu_cpp_library) 2025-05-07T19:49:30.3405543Z 2025-05-07T19:49:30.3405547Z 2025-05-07T19:49:30.3405887Z CMake Warning at /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/cmake/modules/GpuCppLibrary.cmake:246 (add_library): 2025-05-07T19:49:30.3406457Z Cannot generate a safe runtime search path for target 2025-05-07T19:49:30.3406899Z fbgemm_gpu_experimental_example_py because files in some directories may 2025-05-07T19:49:30.3407429Z conflict with libraries in implicit directories: 2025-05-07T19:49:30.3407656Z 2025-05-07T19:49:30.3408095Z runtime library [libnccl.so.2] in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib may be hidden by files in: 2025-05-07T19:49:30.3408714Z /usr/local/cuda-12.8/lib64 2025-05-07T19:49:30.3408880Z 2025-05-07T19:49:30.3409008Z Some of these libraries may not be found correctly. 2025-05-07T19:49:30.3409325Z Call Stack (most recent call first): 2025-05-07T19:49:30.3409737Z experimental/example/CMakeLists.txt:31 (gpu_cpp_library) 2025-05-07T19:49:30.3409999Z 2025-05-07T19:49:30.3410003Z 2025-05-07T19:49:30.3410091Z -- Configuring done (8.9s) 2025-05-07T19:49:30.3440904Z -- Generating done (0.1s) 2025-05-07T19:49:30.3455983Z -- Build files have been written to: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build 2025-05-07T19:49:30.3739103Z Change Dir: '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build' 2025-05-07T19:49:30.3739710Z 2025-05-07T19:49:30.3740033Z Run Build Command(s): /__w/_temp/conda_environment_14891846312/bin/ninja -v -j 8 install 2025-05-07T19:49:30.6735342Z [1/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64func.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64func.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64func.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64func.cpp 2025-05-07T19:49:30.6743201Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64func.cpp:10: 2025-05-07T19:49:30.6746131Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.6749544Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6752823Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.6754434Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6756172Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.6759648Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6762800Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.6764823Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6766543Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.6770362Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6773632Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.6775181Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6776886Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.6780396Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6783583Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.6785477Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6787234Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.6790680Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6793892Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.6795386Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6797145Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.6800574Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6803480Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.6805112Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6807104Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.6810722Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6813952Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.6815506Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6817008Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.6820278Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6823230Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.6824723Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6826394Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.6829748Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6832946Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.6834502Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6836166Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.6839451Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.6842532Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.6844102Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.6844600Z At global scope: 2025-05-07T19:49:30.6845797Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.7038443Z [2/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instdb.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instdb.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instdb.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instdb.cpp 2025-05-07T19:49:30.7046661Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64instdb_p.h:12, 2025-05-07T19:49:30.7047899Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instdb.cpp:11: 2025-05-07T19:49:30.7050096Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.7053628Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7056834Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.7058511Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7060186Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.7063498Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7066624Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.7068240Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7069892Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.7073343Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7076400Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.7078318Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7080000Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.7083549Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7086821Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.7088267Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7089971Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.7093281Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7096350Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.7097898Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7099561Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.7102858Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7105922Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.7107475Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7109147Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.7112551Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7115632Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.7117184Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7119128Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.7122634Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7125741Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.7127291Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7128992Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.7132599Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7135692Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.7137302Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7139021Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.7142329Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7145412Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.7147016Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7147503Z At global scope: 2025-05-07T19:49:30.7148708Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.7274613Z [3/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instapi.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instapi.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instapi.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instapi.cpp 2025-05-07T19:49:30.7283336Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64instdb_p.h:12, 2025-05-07T19:49:30.7285167Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instapi.cpp:13: 2025-05-07T19:49:30.7287476Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.7290997Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7294317Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.7295955Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7297770Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.7301119Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7303842Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.7305240Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7306840Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.7309989Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7313069Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.7314704Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7316491Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.7320017Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7323149Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.7324969Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7326776Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.7330456Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7333559Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.7335147Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7336853Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.7340210Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7343305Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.7344886Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7346591Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.7349907Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7353115Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.7354689Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7356351Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.7359726Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7362816Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.7364359Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7366247Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.7369778Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7372872Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.7374453Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7376150Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.7379498Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7382589Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.7384169Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7384681Z At global scope: 2025-05-07T19:49:30.7386285Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.7394553Z [4/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64formatter.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64formatter.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64formatter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64formatter.cpp 2025-05-07T19:49:30.7402580Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64instdb_p.h:12, 2025-05-07T19:49:30.7403756Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64formatter.cpp:13: 2025-05-07T19:49:30.7405719Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.7409067Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7413485Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.7415271Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7416966Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.7432756Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7435996Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.7437582Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7439316Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.7442643Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7445766Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.7447339Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7449015Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.7452349Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7455348Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.7456785Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7458415Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.7461583Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7464871Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.7466456Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7468318Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.7471684Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7474884Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.7476460Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7478185Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.7481511Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7484613Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.7486617Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7488308Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.7491623Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7494720Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.7496534Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7498208Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.7501560Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7504675Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.7506264Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7507965Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.7511503Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.7514780Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.7516389Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.7516904Z At global scope: 2025-05-07T19:49:30.7518191Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.8072639Z [5/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64builder.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64builder.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64builder.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64builder.cpp 2025-05-07T19:49:30.8080826Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64emitter.h:12, 2025-05-07T19:49:30.8082132Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64assembler.h:10, 2025-05-07T19:49:30.8083262Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64builder.cpp:9: 2025-05-07T19:49:30.8085960Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.8089992Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8093220Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.8094836Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8096648Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.8100196Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8103713Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.8105334Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8107120Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.8110636Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8114026Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.8115624Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8117411Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.8120964Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8124039Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.8125528Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8127332Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.8131021Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8134272Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.8135876Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8137636Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.8141149Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8144625Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.8146209Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8148070Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.8151476Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8154736Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.8156336Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8158117Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.8161615Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8164877Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.8166458Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8168272Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.8171801Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8175220Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.8176866Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8178715Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.8182187Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8185776Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.8187623Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8188140Z At global scope: 2025-05-07T19:49:30.8189412Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.8801140Z [6/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64compiler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64compiler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64compiler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64compiler.cpp 2025-05-07T19:49:30.8806196Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64emitter.h:12, 2025-05-07T19:49:30.8807017Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64assembler.h:10, 2025-05-07T19:49:30.8807725Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64compiler.cpp:9: 2025-05-07T19:49:30.8809212Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.8811500Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8813445Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.8814715Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8815798Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.8818491Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8820436Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.8821424Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8822502Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.8824792Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8826733Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.8827713Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8828769Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.8830867Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8832854Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.8833765Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8834821Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.8836929Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8838864Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.8839837Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8841005Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.8843107Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8845036Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.8845999Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8847114Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.8849298Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8851227Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.8852200Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8853276Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.8855373Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8857301Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.8858269Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8859335Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.8861447Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8863392Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.8864378Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8865566Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.8867722Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.8869662Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.8870648Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.8870975Z At global scope: 2025-05-07T19:49:30.8871758Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.9007273Z [7/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64operand.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64operand.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64operand.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64operand.cpp 2025-05-07T19:49:30.9014036Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64operand.cpp:10: 2025-05-07T19:49:30.9016073Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.9018990Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9021679Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.9023032Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9024574Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.9027912Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9030568Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.9032365Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9033928Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.9036789Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9039579Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.9041063Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9042520Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.9045570Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9048151Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.9049593Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9051253Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:30.9054597Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9057583Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.9058927Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9060417Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:30.9063304Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9066084Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.9067472Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9069116Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:30.9072203Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9075419Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.9076947Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9078400Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:30.9081702Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9084374Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:30.9086105Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9087634Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:30.9090509Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9093232Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:30.9094601Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9096042Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:30.9099049Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9101725Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:30.9103112Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9103609Z At global scope: 2025-05-07T19:49:30.9104900Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:30.9944674Z [8/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64emithelper.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64emithelper.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64emithelper.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64emithelper.cpp 2025-05-07T19:49:30.9953179Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64emitter.h:12, 2025-05-07T19:49:30.9954530Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64emithelper_p.h:13, 2025-05-07T19:49:30.9955695Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64emithelper.cpp:14: 2025-05-07T19:49:30.9957998Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:30.9961451Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9964655Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:30.9966336Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9968154Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:30.9971682Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9974957Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:30.9976613Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9978459Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:30.9981940Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9987499Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:30.9989023Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:30.9991132Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:30.9994762Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:30.9997880Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:30.9999355Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0001185Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:31.0004745Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.0008020Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.0009623Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0011417Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:31.0014951Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.0018140Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.0019711Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0021524Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:31.0025018Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.0028244Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.0030204Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0032294Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:31.0035865Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.0039039Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:31.0040655Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0042451Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:31.0046143Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.0049489Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:31.0051070Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0052779Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:31.0056195Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.0059326Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:31.0060946Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.0061442Z At global scope: 2025-05-07T19:49:31.0062683Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:31.1585176Z [9/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/armformatter.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/armformatter.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/armformatter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/armformatter.cpp 2025-05-07T19:49:31.1593756Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/armformatter.cpp:12: 2025-05-07T19:49:31.1596041Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:31.1599445Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1602529Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.1604139Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1605867Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:31.1609165Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1612299Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.1613876Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1615554Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:31.1618959Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1622050Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.1623634Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1625333Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:31.1628698Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1632090Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:31.1633527Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1635577Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:31.1638934Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1642107Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.1643735Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1645476Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:31.1648824Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1651969Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.1653533Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1655222Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:31.1658621Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1661736Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.1663358Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1665034Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:31.1668179Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1671048Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:31.1672877Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1674412Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:31.1677686Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1680708Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:31.1682251Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1683870Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:31.1687586Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1690663Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:31.1692246Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1692768Z At global scope: 2025-05-07T19:49:31.1694011Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:31.1956040Z [10/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/archtraits.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/archtraits.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/archtraits.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/archtraits.cpp 2025-05-07T19:49:31.1964075Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/a64archtraits_p.h:13, 2025-05-07T19:49:31.1965359Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/archtraits.cpp:16: 2025-05-07T19:49:31.1967564Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:31.1971087Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1974952Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.1976564Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1978346Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:31.1981841Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1985445Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.1987078Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1988852Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:31.1992488Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.1995660Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.1997311Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.1999122Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:31.2002561Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2005637Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:31.2007112Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2008836Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:31.2013836Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2017371Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.2019000Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2020959Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:31.2024276Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2027405Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.2028979Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2030695Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:31.2034266Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2037419Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.2038990Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2040750Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:31.2044069Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2047108Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:31.2048780Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2050521Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:31.2053827Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2057009Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:31.2058927Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2060709Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:31.2064435Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.2067777Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:31.2069421Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.2069950Z At global scope: 2025-05-07T19:49:31.2071260Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:31.2079629Z [11/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/assembler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/assembler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/assembler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/assembler.cpp 2025-05-07T19:49:31.3388562Z [12/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codewriter.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codewriter.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codewriter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codewriter.cpp 2025-05-07T19:49:31.4580083Z [13/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/constpool.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/constpool.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/constpool.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/constpool.cpp 2025-05-07T19:49:31.4637712Z [14/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/builder.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/builder.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/builder.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/builder.cpp 2025-05-07T19:49:31.6963132Z [15/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64rapass.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64rapass.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64rapass.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64rapass.cpp 2025-05-07T19:49:31.6968142Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64emitter.h:12, 2025-05-07T19:49:31.6968944Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64assembler.h:10, 2025-05-07T19:49:31.6969679Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64rapass.cpp:12: 2025-05-07T19:49:31.6971164Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:31.6973291Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.6975549Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.6976682Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.6977815Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:31.6979928Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.6981895Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.6982877Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.6983945Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:31.6986528Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.6988554Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.6989526Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.6990601Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:31.6992785Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.6994642Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:31.6995538Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7005864Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:31.7008168Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.7010129Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:31.7011460Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7012675Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:31.7014795Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.7016737Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:31.7017740Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7018806Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:31.7020912Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.7022851Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:31.7023833Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7024901Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:31.7027008Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.7028938Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:31.7029929Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7031002Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:31.7033290Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.7035239Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:31.7036346Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7037421Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:31.7039611Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:31.7041952Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:31.7043221Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:31.7043538Z At global scope: 2025-05-07T19:49:31.7044364Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:31.7049608Z [16/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/compiler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/compiler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/compiler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/compiler.cpp 2025-05-07T19:49:31.7286523Z [17/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emithelper.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emithelper.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emithelper.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emithelper.cpp 2025-05-07T19:49:31.7982665Z [18/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitterutils.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitterutils.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitterutils.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitterutils.cpp 2025-05-07T19:49:31.8051670Z [19/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitter.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitter.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitter.cpp 2025-05-07T19:49:31.8065784Z [20/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/cpuinfo.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/cpuinfo.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/cpuinfo.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/cpuinfo.cpp 2025-05-07T19:49:31.8195869Z [21/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/errorhandler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/errorhandler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/errorhandler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/errorhandler.cpp 2025-05-07T19:49:31.8866377Z [22/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codeholder.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codeholder.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codeholder.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codeholder.cpp 2025-05-07T19:49:31.8881357Z [23/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/environment.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/environment.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/environment.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/environment.cpp 2025-05-07T19:49:31.9418477Z [24/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/globals.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/globals.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/globals.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/globals.cpp 2025-05-07T19:49:32.0315262Z [25/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/inst.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/inst.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/inst.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/inst.cpp 2025-05-07T19:49:32.1133008Z [26/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/func.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/func.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/func.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/func.cpp 2025-05-07T19:49:32.1359128Z [27/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/instdb.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/instdb.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/instdb.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/instdb.cpp 2025-05-07T19:49:32.1541629Z [28/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/funcargscontext.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/funcargscontext.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/funcargscontext.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/funcargscontext.cpp 2025-05-07T19:49:32.1683050Z [29/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/formatter.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/formatter.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/formatter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/formatter.cpp 2025-05-07T19:49:32.2055265Z [30/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitruntime.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitruntime.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitruntime.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitruntime.cpp 2025-05-07T19:49:32.2489442Z [31/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64assembler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64assembler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64assembler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64assembler.cpp 2025-05-07T19:49:32.2497780Z In file included from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/a64emitter.h:12, 2025-05-07T19:49:32.2499047Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/a64assembler.h:10, 2025-05-07T19:49:32.2500203Z from /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64assembler.cpp:18: 2025-05-07T19:49:32.2502520Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB8() const’: 2025-05-07T19:49:32.2506403Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:132:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2509964Z 132 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:32.2511546Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2513464Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH4() const’: 2025-05-07T19:49:32.2517037Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:133:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2520193Z 133 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:32.2521791Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2523591Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS2() const’: 2025-05-07T19:49:32.2527050Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:134:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2530295Z 134 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:32.2531969Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2533819Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD1() const’: 2025-05-07T19:49:32.2537335Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:135:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2540366Z 135 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD1() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature); } 2025-05-07T19:49:32.2541797Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2543463Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB16() const’: 2025-05-07T19:49:32.2547194Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:137:112: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2550482Z 137 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB16() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB); } 2025-05-07T19:49:32.2552248Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2554088Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH8() const’: 2025-05-07T19:49:32.2557704Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:138:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2561017Z 138 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH8() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH); } 2025-05-07T19:49:32.2562633Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2564422Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecS4() const’: 2025-05-07T19:49:32.2567869Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:139:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2570957Z 139 | ASMJIT_INLINE_NODEBUG constexpr bool isVecS4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementS); } 2025-05-07T19:49:32.2572593Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2574401Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecD2() const’: 2025-05-07T19:49:32.2577937Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:140:111: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2581153Z 140 | ASMJIT_INLINE_NODEBUG constexpr bool isVecD2() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementD); } 2025-05-07T19:49:32.2582765Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2584563Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecB4x4() const’: 2025-05-07T19:49:32.2588390Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:141:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2592102Z 141 | ASMJIT_INLINE_NODEBUG constexpr bool isVecB4x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementB4); } 2025-05-07T19:49:32.2593997Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2595815Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h: In member function ‘constexpr bool asmjit::_abi_1_13::a64::Vec::isVecH2x4() const’: 2025-05-07T19:49:32.2599282Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/../arm/../arm/../arm/a64operand.h:142:113: warning: bitwise operation between different enumeration types ‘asmjit::_abi_1_13::BaseReg::’ and ‘asmjit::_abi_1_13::arm::BaseVec::AdditionalBits’ is deprecated [-Wdeprecated-enum-enum-conversion] 2025-05-07T19:49:32.2602428Z 142 | ASMJIT_INLINE_NODEBUG constexpr bool isVecH2x4() const noexcept { return _signature.subset(kBaseSignatureMask | kSignatureRegElementTypeMask) == (RegTraits::kSignature | kSignatureElementH2); } 2025-05-07T19:49:32.2604073Z | ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2025-05-07T19:49:32.2604581Z At global scope: 2025-05-07T19:49:32.2605867Z cc1plus: note: unrecognized command-line option ‘-Wno-deprecated-anon-enum-enum-conversion’ may have been intended to silence earlier diagnostics 2025-05-07T19:49:32.2630955Z [32/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/logger.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/logger.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/logger.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/logger.cpp 2025-05-07T19:49:32.2713075Z [33/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/operand.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/operand.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/operand.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/operand.cpp 2025-05-07T19:49:32.2889277Z [34/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/osutils.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/osutils.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/osutils.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/osutils.cpp 2025-05-07T19:49:32.3845698Z [35/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/support.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/support.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/support.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/support.cpp 2025-05-07T19:49:32.4028625Z [36/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitallocator.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitallocator.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitallocator.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitallocator.cpp 2025-05-07T19:49:32.4754392Z [37/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/string.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/string.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/string.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/string.cpp 2025-05-07T19:49:32.4768657Z [38/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/target.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/target.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/target.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/target.cpp 2025-05-07T19:49:32.4982364Z [39/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/type.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/type.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/type.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/type.cpp 2025-05-07T19:49:32.5088556Z [40/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rastack.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rastack.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rastack.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rastack.cpp 2025-05-07T19:49:32.5917412Z [41/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zone.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zone.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zone.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zone.cpp 2025-05-07T19:49:32.6068748Z [42/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonelist.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonelist.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonelist.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonelist.cpp 2025-05-07T19:49:32.6402134Z [43/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonetree.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonetree.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonetree.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonetree.cpp 2025-05-07T19:49:32.6411286Z [44/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonehash.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonehash.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonehash.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonehash.cpp 2025-05-07T19:49:32.6463515Z [45/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonestack.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonestack.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonestack.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonestack.cpp 2025-05-07T19:49:32.7891724Z [46/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonevector.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonevector.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonevector.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonevector.cpp 2025-05-07T19:49:32.9268183Z [47/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/ralocal.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/ralocal.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/ralocal.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/ralocal.cpp 2025-05-07T19:49:32.9349453Z [48/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/virtmem.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/virtmem.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/virtmem.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/virtmem.cpp 2025-05-07T19:49:33.3011192Z [49/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86formatter.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86formatter.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86formatter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86formatter.cpp 2025-05-07T19:49:33.4063864Z [50/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rapass.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rapass.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rapass.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rapass.cpp 2025-05-07T19:49:33.6700154Z [51/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86operand.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86operand.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86operand.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86operand.cpp 2025-05-07T19:49:33.6845204Z [52/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instapi.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instapi.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instapi.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instapi.cpp 2025-05-07T19:49:33.9089145Z [53/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instdb.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instdb.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instdb.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instdb.cpp 2025-05-07T19:49:34.2341834Z [54/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86builder.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86builder.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86builder.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86builder.cpp 2025-05-07T19:49:34.2835952Z [55/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86emithelper.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86emithelper.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86emithelper.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86emithelper.cpp 2025-05-07T19:49:34.3071625Z [56/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86compiler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86compiler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86compiler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86compiler.cpp 2025-05-07T19:49:34.3569030Z [57/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86func.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86func.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86func.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86func.cpp 2025-05-07T19:49:35.6768124Z [58/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtils.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtils.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtils.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtils.cc 2025-05-07T19:49:35.7195597Z [59/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86assembler.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86assembler.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86assembler.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86assembler.cpp 2025-05-07T19:49:35.9704657Z [60/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dasmjit_EXPORTS -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86rapass.cpp.o -MF CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86rapass.cpp.o.d -o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86rapass.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86rapass.cpp 2025-05-07T19:49:36.4213639Z [61/155] : && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -fPIC -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -s -shared -Wl,-soname,asmjit.so -o asmjit.so CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64assembler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64builder.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64compiler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64emithelper.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64formatter.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64func.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instapi.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64instdb.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64operand.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/a64rapass.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/arm/armformatter.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/archtraits.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/assembler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/builder.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codeholder.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/codewriter.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/compiler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/constpool.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/cpuinfo.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emithelper.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitter.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/emitterutils.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/environment.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/errorhandler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/formatter.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/func.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/funcargscontext.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/globals.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/inst.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/instdb.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitallocator.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/jitruntime.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/logger.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/operand.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/osutils.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/ralocal.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rapass.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/rastack.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/string.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/support.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/target.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/type.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/virtmem.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zone.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonehash.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonelist.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonestack.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonetree.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/core/zonevector.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86assembler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86builder.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86compiler.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86emithelper.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86formatter.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86func.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instapi.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86instdb.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86operand.cpp.o CMakeFiles/asmjit.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/external/asmjit/src/asmjit/x86/x86rapass.cpp.o -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs: /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libnvrtc.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 /usr/lib64/libcuda.so /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so" -Wl,--as-needed /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libcudart.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so" -Wl,--as-needed && : 2025-05-07T19:49:37.2728862Z [62/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAvx2.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAvx2.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAvx2.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAvx2.cc 2025-05-07T19:49:38.2998146Z [63/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/Utils.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/Utils.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/Utils.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/Utils.cc 2025-05-07T19:49:38.3119462Z [64/155] cd /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build && bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../.github/scripts/fbgemm_gpu_postbuild.bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T19:49:38.3121942Z ################################################################################ 2025-05-07T19:49:38.3122439Z [CMAKE] Running post-build script ... 2025-05-07T19:49:38.3123241Z Target file: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T19:49:38.3124040Z Removing all RPATHs ... 2025-05-07T19:49:38.3124437Z ################################################################################ 2025-05-07T19:49:39.5248741Z [65/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/SparseAdagrad.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/SparseAdagrad.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/SparseAdagrad.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/SparseAdagrad.cc 2025-05-07T19:49:40.3210281Z [66/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RefImplementations.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RefImplementations.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RefImplementations.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RefImplementations.cc 2025-05-07T19:49:42.1068435Z [67/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RowWiseSparseAdagradFused.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RowWiseSparseAdagradFused.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RowWiseSparseAdagradFused.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RowWiseSparseAdagradFused.cc 2025-05-07T19:49:43.2590713Z [68/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAutovec.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAutovec.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAutovec.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAutovec.cc 2025-05-07T19:49:46.9795981Z [69/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/attention.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/attention.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/attention.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/attention/attention.cpp 2025-05-07T19:49:48.5680247Z [70/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/coalesce/coalesce.cpp 2025-05-07T19:49:51.5499041Z [71/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cpp 2025-05-07T19:49:51.6492672Z [72/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cpp 2025-05-07T19:49:53.0160900Z [73/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtilsAvx2.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtilsAvx2.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtilsAvx2.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtilsAvx2.cc 2025-05-07T19:49:53.6993359Z [74/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/comm/car.cpp 2025-05-07T19:49:55.4848713Z [75/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/moe/index_shuffling.cpp 2025-05-07T19:49:59.0939082Z [76/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cpp.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cpp.o.d -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cpp 2025-05-07T19:49:59.6593298Z [77/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMNBit.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMNBit.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMNBit.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMNBit.cc 2025-05-07T19:50:36.8113051Z [78/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDM.cc.o -MF CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDM.cc.o.d -o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDM.cc.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDM.cc 2025-05-07T19:50:37.4063534Z [79/155] : && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -fPIC -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -s -shared -Wl,-soname,fbgemm.so -o fbgemm.so CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDM.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAutovec.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMNBit.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtils.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RefImplementations.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/RowWiseSparseAdagradFused.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/SparseAdagrad.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/Utils.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/EmbeddingSpMDMAvx2.cc.o CMakeFiles/fbgemm.dir/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/src/QuantUtilsAvx2.cc.o -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -Wl,-rpath,"\$ORIGIN" /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libnvrtc.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 /usr/lib64/libcuda.so asmjit.so /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so" -Wl,--as-needed /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so" -Wl,--as-needed /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libcudart.so && : 2025-05-07T19:50:37.4336265Z [80/155] cd /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build && bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../.github/scripts/fbgemm_gpu_postbuild.bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 1 2025-05-07T19:50:37.4338251Z ################################################################################ 2025-05-07T19:50:37.4339222Z [CMAKE] Running post-build script ... 2025-05-07T19:50:37.4340064Z Target file: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T19:50:37.4340891Z Resetting RPATH to $ORIGIN ... 2025-05-07T19:50:37.4341414Z 0x000000000000000f (RPATH) Library rpath: [$ORIGIN] 2025-05-07T19:50:37.4341989Z ################################################################################ 2025-05-07T19:51:14.1582119Z [81/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/coalesce/coalesce.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cu.o 2025-05-07T19:51:14.1598979Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:51:29.8248310Z [82/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/gqa_attn_splitk.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/gqa_attn_splitk.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/attention/gqa_attn_splitk.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/gqa_attn_splitk.cu.o 2025-05-07T19:51:29.8265366Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:51:44.6212057Z [83/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cu.o 2025-05-07T19:51:44.6229275Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:51:44.6231609Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(202): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6233471Z static auto dtype() { 2025-05-07T19:51:44.6233836Z ^ 2025-05-07T19:51:44.6234028Z 2025-05-07T19:51:44.6234375Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:51:44.6234895Z 2025-05-07T19:51:44.6236489Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(195): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6237987Z static auto dtype() { 2025-05-07T19:51:44.6238323Z ^ 2025-05-07T19:51:44.6238504Z 2025-05-07T19:51:44.6239785Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(188): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6241356Z static auto dtype() { 2025-05-07T19:51:44.6241694Z ^ 2025-05-07T19:51:44.6241869Z 2025-05-07T19:51:44.6243129Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(202): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6244518Z static auto dtype() { 2025-05-07T19:51:44.6244881Z ^ 2025-05-07T19:51:44.6245063Z 2025-05-07T19:51:44.6245406Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:51:44.6245912Z 2025-05-07T19:51:44.6247015Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(195): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6248349Z static auto dtype() { 2025-05-07T19:51:44.6248665Z ^ 2025-05-07T19:51:44.6248843Z 2025-05-07T19:51:44.6250048Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(188): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6251547Z static auto dtype() { 2025-05-07T19:51:44.6251881Z ^ 2025-05-07T19:51:44.6252068Z 2025-05-07T19:51:44.6253241Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(202): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6254678Z static auto dtype() { 2025-05-07T19:51:44.6255031Z ^ 2025-05-07T19:51:44.6255211Z 2025-05-07T19:51:44.6255543Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:51:44.6256042Z 2025-05-07T19:51:44.6257221Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(195): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6258928Z static auto dtype() { 2025-05-07T19:51:44.6259271Z ^ 2025-05-07T19:51:44.6262100Z 2025-05-07T19:51:44.6263417Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(188): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6265045Z static auto dtype() { 2025-05-07T19:51:44.6265402Z ^ 2025-05-07T19:51:44.6265591Z 2025-05-07T19:51:44.6266768Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(202): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6268275Z static auto dtype() { 2025-05-07T19:51:44.6268625Z ^ 2025-05-07T19:51:44.6268818Z 2025-05-07T19:51:44.6269152Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:51:44.6269653Z 2025-05-07T19:51:44.6270818Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(195): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6272591Z static auto dtype() { 2025-05-07T19:51:44.6272964Z ^ 2025-05-07T19:51:44.6273153Z 2025-05-07T19:51:44.6274268Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(188): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6275784Z static auto dtype() { 2025-05-07T19:51:44.6276111Z ^ 2025-05-07T19:51:44.6276305Z 2025-05-07T19:51:44.6277490Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(202): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6278934Z static auto dtype() { 2025-05-07T19:51:44.6279269Z ^ 2025-05-07T19:51:44.6279443Z 2025-05-07T19:51:44.6279783Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:51:44.6280277Z 2025-05-07T19:51:44.6281447Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(195): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6282905Z static auto dtype() { 2025-05-07T19:51:44.6283239Z ^ 2025-05-07T19:51:44.6283421Z 2025-05-07T19:51:44.6284642Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(188): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6286724Z static auto dtype() { 2025-05-07T19:51:44.6287056Z ^ 2025-05-07T19:51:44.6287238Z 2025-05-07T19:51:44.6288373Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(202): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6289890Z static auto dtype() { 2025-05-07T19:51:44.6290229Z ^ 2025-05-07T19:51:44.6290416Z 2025-05-07T19:51:44.6290752Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:51:44.6291261Z 2025-05-07T19:51:44.6292482Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(195): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6294293Z static auto dtype() { 2025-05-07T19:51:44.6294641Z ^ 2025-05-07T19:51:44.6294821Z 2025-05-07T19:51:44.6296057Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/gather_scatter/gather_scatter.cu(188): warning #177-D: function "fbgemm_gpu::::TorchDTypeTrait::dtype" was declared but never referenced 2025-05-07T19:51:44.6297587Z static auto dtype() { 2025-05-07T19:51:44.6297926Z ^ 2025-05-07T19:51:44.6298105Z 2025-05-07T19:52:28.6578935Z [84/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/moe/index_shuffling.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cu.o 2025-05-07T19:52:28.6595258Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:52:45.1936391Z [85/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cu.o 2025-05-07T19:52:45.1951708Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:52:45.1954138Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(147): warning #177-D: variable "fbgemm_gpu::CVT_FP4_ELTS_PER_THREAD" was declared but never referenced 2025-05-07T19:52:45.1955443Z constexpr int CVT_FP4_ELTS_PER_THREAD = 8; 2025-05-07T19:52:45.1955816Z ^ 2025-05-07T19:52:45.1956009Z 2025-05-07T19:52:45.1956335Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:52:45.1956786Z 2025-05-07T19:52:45.1957680Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(148): warning #177-D: variable "fbgemm_gpu::CVT_FP4_SF_VEC_SIZE" was declared but never referenced 2025-05-07T19:52:45.1958829Z constexpr int CVT_FP4_SF_VEC_SIZE = 16; 2025-05-07T19:52:45.1959227Z ^ 2025-05-07T19:52:45.1959406Z 2025-05-07T19:52:45.1960305Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(147): warning #177-D: variable "fbgemm_gpu::CVT_FP4_ELTS_PER_THREAD" was declared but never referenced 2025-05-07T19:52:45.1961554Z constexpr int CVT_FP4_ELTS_PER_THREAD = 8; 2025-05-07T19:52:45.1961982Z ^ 2025-05-07T19:52:45.1962161Z 2025-05-07T19:52:45.1962543Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:52:45.1963111Z 2025-05-07T19:52:45.1964142Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(148): warning #177-D: variable "fbgemm_gpu::CVT_FP4_SF_VEC_SIZE" was declared but never referenced 2025-05-07T19:52:45.1965500Z constexpr int CVT_FP4_SF_VEC_SIZE = 16; 2025-05-07T19:52:45.1965907Z ^ 2025-05-07T19:52:45.1966088Z 2025-05-07T19:52:45.1967108Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(147): warning #177-D: variable "fbgemm_gpu::CVT_FP4_ELTS_PER_THREAD" was declared but never referenced 2025-05-07T19:52:45.1968537Z constexpr int CVT_FP4_ELTS_PER_THREAD = 8; 2025-05-07T19:52:45.1968927Z ^ 2025-05-07T19:52:45.1969092Z 2025-05-07T19:52:45.1969401Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:52:45.1969834Z 2025-05-07T19:52:45.1970700Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(148): warning #177-D: variable "fbgemm_gpu::CVT_FP4_SF_VEC_SIZE" was declared but never referenced 2025-05-07T19:52:45.1971850Z constexpr int CVT_FP4_SF_VEC_SIZE = 16; 2025-05-07T19:52:45.1972220Z ^ 2025-05-07T19:52:45.1972404Z 2025-05-07T19:52:45.1973334Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(147): warning #177-D: variable "fbgemm_gpu::CVT_FP4_ELTS_PER_THREAD" was declared but never referenced 2025-05-07T19:52:45.1974518Z constexpr int CVT_FP4_ELTS_PER_THREAD = 8; 2025-05-07T19:52:45.1974916Z ^ 2025-05-07T19:52:45.1975086Z 2025-05-07T19:52:45.1975410Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:52:45.1975855Z 2025-05-07T19:52:45.1976744Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/quantize.cu(148): warning #177-D: variable "fbgemm_gpu::CVT_FP4_SF_VEC_SIZE" was declared but never referenced 2025-05-07T19:52:45.1977884Z constexpr int CVT_FP4_SF_VEC_SIZE = 16; 2025-05-07T19:52:45.1978427Z ^ 2025-05-07T19:52:45.1978599Z 2025-05-07T19:52:45.1979325Z ptxas warning : Value of threads per SM for entry _ZN10fbgemm_gpu15cvt_fp16_to_fp4I13__nv_bfloat16Lb0EEEviiPKT_PKfPjS7_ is out of range. .minnctapersm will be ignored 2025-05-07T19:52:45.1980899Z ptxas warning : Value of threads per SM for entry _ZN10fbgemm_gpu15cvt_fp16_to_fp4I13__nv_bfloat16Lb1EEEviiPKT_PKfPjS7_ is out of range. .minnctapersm will be ignored 2025-05-07T19:52:45.1982423Z ptxas warning : Value of threads per SM for entry _ZN10fbgemm_gpu15cvt_fp16_to_fp4I6__halfLb0EEEviiPKT_PKfPjS7_ is out of range. .minnctapersm will be ignored 2025-05-07T19:52:45.1984029Z ptxas warning : Value of threads per SM for entry _ZN10fbgemm_gpu15cvt_fp16_to_fp4I6__halfLb1EEEviiPKT_PKfPjS7_ is out of range. .minnctapersm will be ignored 2025-05-07T19:52:46.2092532Z [86/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/comm/car.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cu.o 2025-05-07T19:52:46.2107299Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:53:03.2766233Z [87/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/kv_cache/kv_cache.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cu.o 2025-05-07T19:53:03.2781802Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:53:43.2355806Z [88/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16.cu.o 2025-05-07T19:53:43.2374293Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:54:23.2441546Z [89/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu.o 2025-05-07T19:54:23.2460975Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:54:23.2463281Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:54:23.2464998Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:54:23.2465647Z ^ 2025-05-07T19:54:23.2465885Z 2025-05-07T19:54:23.2466193Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:54:23.2466706Z 2025-05-07T19:54:23.2467955Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:54:23.2469722Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:54:23.2470352Z ^ 2025-05-07T19:54:23.2470595Z 2025-05-07T19:55:16.3008723Z [90/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu.o 2025-05-07T19:55:16.3026889Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:55:16.3029214Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:55:16.3030923Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:55:16.3031559Z ^ 2025-05-07T19:55:16.3031964Z 2025-05-07T19:55:16.3032311Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:55:16.3032807Z 2025-05-07T19:55:16.3034114Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:55:16.3035872Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:55:16.3036484Z ^ 2025-05-07T19:55:16.3036712Z 2025-05-07T19:57:34.6655838Z [91/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu.o 2025-05-07T19:57:34.6682251Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:57:34.6684505Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:57:34.6686537Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:57:34.6687147Z ^ 2025-05-07T19:57:34.6687394Z 2025-05-07T19:57:34.6687728Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:57:34.6688246Z 2025-05-07T19:57:34.6689461Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:57:34.6691243Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:57:34.6691878Z ^ 2025-05-07T19:57:34.6692127Z 2025-05-07T19:58:05.0524502Z [92/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu.o 2025-05-07T19:58:05.0543113Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:58:05.0545544Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.0547248Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.0547867Z ^ 2025-05-07T19:58:05.0548119Z 2025-05-07T19:58:05.0548468Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.0548970Z 2025-05-07T19:58:05.0550222Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.0552118Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:58:05.0552725Z ^ 2025-05-07T19:58:05.0552965Z 2025-05-07T19:58:05.0554160Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.0555852Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.0556454Z ^ 2025-05-07T19:58:05.0556831Z detected during: 2025-05-07T19:58:05.0577617Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.0618014Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.0658621Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.0682032Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.0683783Z 2025-05-07T19:58:05.0684129Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.0684644Z 2025-05-07T19:58:05.0686478Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.0688136Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.0689005Z ^ 2025-05-07T19:58:05.0689322Z detected during: 2025-05-07T19:58:05.0709188Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:05.0749325Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.0789787Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.0830811Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.0854091Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.0855860Z 2025-05-07T19:58:05.0857089Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.0858811Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.0859424Z ^ 2025-05-07T19:58:05.0859789Z detected during: 2025-05-07T19:58:05.0881140Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.0921823Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.0962598Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.0986316Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.0988059Z 2025-05-07T19:58:05.0988406Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.0988926Z 2025-05-07T19:58:05.0990166Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.0991950Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.0992525Z ^ 2025-05-07T19:58:05.0992836Z detected during: 2025-05-07T19:58:05.1012770Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:05.1054725Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.1095306Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.1135949Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.1159284Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.1160992Z 2025-05-07T19:58:05.1162251Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.1163963Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.1164575Z ^ 2025-05-07T19:58:05.1164932Z detected during: 2025-05-07T19:58:05.1186528Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.1226096Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.1266728Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.1290219Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.1291999Z 2025-05-07T19:58:05.1292352Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.1292861Z 2025-05-07T19:58:05.1294090Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.1295769Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.1296339Z ^ 2025-05-07T19:58:05.1296646Z detected during: 2025-05-07T19:58:05.1316666Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:05.1357288Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.1397968Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.1438725Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.1461760Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.1463515Z 2025-05-07T19:58:05.1465477Z ptxas /tmp/tmpxft_00000ea0_00000000-9_f4f4bf16_128_128_4_1_1_f.compute_90.ptx, line 925; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:05.1469045Z ptxas /tmp/tmpxft_00000ea0_00000000-9_f4f4bf16_128_128_4_1_1_f.compute_90.ptx, line 937; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:05.1472847Z ptxas /tmp/tmpxft_00000ea0_00000000-9_f4f4bf16_128_128_4_1_1_f.compute_90.ptx, line 1076; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:05.1476373Z ptxas /tmp/tmpxft_00000ea0_00000000-9_f4f4bf16_128_128_4_1_1_f.compute_90.ptx, line 1088; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:05.1479434Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.1481202Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.1481833Z ^ 2025-05-07T19:58:05.1482191Z detected during: 2025-05-07T19:58:05.1503374Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.1542754Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.1583157Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.1606718Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.1608451Z 2025-05-07T19:58:05.1608804Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.1609307Z 2025-05-07T19:58:05.1610898Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.1612635Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.1613193Z ^ 2025-05-07T19:58:05.1613500Z detected during: 2025-05-07T19:58:05.1633177Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:05.1673740Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.1714346Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.1755145Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.1778291Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.1780028Z 2025-05-07T19:58:05.1781250Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.1783875Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.1784576Z ^ 2025-05-07T19:58:05.1785612Z detected during: 2025-05-07T19:58:05.1807042Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.1847934Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.1900884Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.1924040Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.1925792Z 2025-05-07T19:58:05.1926142Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.1926652Z 2025-05-07T19:58:05.1927898Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.1929560Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.1930128Z ^ 2025-05-07T19:58:05.1930441Z detected during: 2025-05-07T19:58:05.1950190Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:05.1991088Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.2032247Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.2073065Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.2096798Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.2098549Z 2025-05-07T19:58:05.2099790Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.2101539Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.2102174Z ^ 2025-05-07T19:58:05.2102529Z detected during: 2025-05-07T19:58:05.2123884Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.2163887Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.2205263Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.2228568Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.2230333Z 2025-05-07T19:58:05.2230687Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:05.2231191Z 2025-05-07T19:58:05.2232695Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:05.2234384Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:05.2234940Z ^ 2025-05-07T19:58:05.2235253Z detected during: 2025-05-07T19:58:05.2254846Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:05.2296417Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:05.2336760Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:05.2377134Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:05.2400930Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu 2025-05-07T19:58:05.2402708Z 2025-05-07T19:58:06.5472932Z [93/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu.o 2025-05-07T19:58:06.5491011Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:58:06.5493247Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.5494876Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.5495469Z ^ 2025-05-07T19:58:06.5495699Z 2025-05-07T19:58:06.5496027Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.5496512Z 2025-05-07T19:58:06.5497705Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.5499632Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:58:06.5500215Z ^ 2025-05-07T19:58:06.5500450Z 2025-05-07T19:58:06.5501641Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.5503306Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.5503902Z ^ 2025-05-07T19:58:06.5504259Z detected during: 2025-05-07T19:58:06.5524331Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.5563028Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.5602442Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.5624630Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.5626452Z 2025-05-07T19:58:06.5626788Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.5627256Z 2025-05-07T19:58:06.5628443Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.5630071Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.5630619Z ^ 2025-05-07T19:58:06.5630917Z detected during: 2025-05-07T19:58:06.5649966Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:06.5689291Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.5727851Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.5767105Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.5789587Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.5791246Z 2025-05-07T19:58:06.5792553Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.5794196Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.5794814Z ^ 2025-05-07T19:58:06.5795162Z detected during: 2025-05-07T19:58:06.5815434Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.5854014Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.5893453Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.5915899Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.5917570Z 2025-05-07T19:58:06.5917909Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.5918404Z 2025-05-07T19:58:06.5919599Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.5921416Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.5921984Z ^ 2025-05-07T19:58:06.5922280Z detected during: 2025-05-07T19:58:06.5940979Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:06.5979843Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6018568Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6057814Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6079863Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6081504Z 2025-05-07T19:58:06.6082749Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6084422Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6085367Z ^ 2025-05-07T19:58:06.6086004Z detected during: 2025-05-07T19:58:06.6105847Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6144089Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6182888Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6205566Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6207219Z 2025-05-07T19:58:06.6207573Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.6208072Z 2025-05-07T19:58:06.6209266Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6210891Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6211440Z ^ 2025-05-07T19:58:06.6211807Z detected during: 2025-05-07T19:58:06.6230682Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:06.6269745Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6308718Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6348202Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6372441Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6374084Z 2025-05-07T19:58:06.6375770Z ptxas /tmp/tmpxft_00000ea6_00000000-9_f4f4bf16_128_128_4_1_1_t.compute_90.ptx, line 925; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:06.6379164Z ptxas /tmp/tmpxft_00000ea6_00000000-9_f4f4bf16_128_128_4_1_1_t.compute_90.ptx, line 937; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:06.6382754Z ptxas /tmp/tmpxft_00000ea6_00000000-9_f4f4bf16_128_128_4_1_1_t.compute_90.ptx, line 1076; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:06.6386632Z ptxas /tmp/tmpxft_00000ea6_00000000-9_f4f4bf16_128_128_4_1_1_t.compute_90.ptx, line 1088; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:58:06.6389628Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6391307Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6392077Z ^ 2025-05-07T19:58:06.6392425Z detected during: 2025-05-07T19:58:06.6412612Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6451130Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6491243Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6513803Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6515480Z 2025-05-07T19:58:06.6515830Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.6516335Z 2025-05-07T19:58:06.6517517Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6519140Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6519692Z ^ 2025-05-07T19:58:06.6519993Z detected during: 2025-05-07T19:58:06.6539039Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:06.6578451Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6617035Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6656117Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6678335Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6679991Z 2025-05-07T19:58:06.6681188Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6682879Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6683479Z ^ 2025-05-07T19:58:06.6683835Z detected during: 2025-05-07T19:58:06.6705041Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6743509Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6782403Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6805308Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6806991Z 2025-05-07T19:58:06.6807329Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.6807828Z 2025-05-07T19:58:06.6809257Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6810897Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6811441Z ^ 2025-05-07T19:58:06.6811745Z detected during: 2025-05-07T19:58:06.6830663Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:06.6869454Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.6908357Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.6947492Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.6969839Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.6971509Z 2025-05-07T19:58:06.6972711Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.6974521Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.6975111Z ^ 2025-05-07T19:58:06.6975460Z detected during: 2025-05-07T19:58:06.6995867Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.7034091Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.7073209Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.7095762Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.7097414Z 2025-05-07T19:58:06.7097750Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:06.7098238Z 2025-05-07T19:58:06.7099455Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:06.7101068Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:06.7101606Z ^ 2025-05-07T19:58:06.7101898Z detected during: 2025-05-07T19:58:06.7121059Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=10, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T19:58:06.7160041Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:58:06.7198891Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:58:06.7237855Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:58:06.7260120Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=128, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu 2025-05-07T19:58:06.7261792Z 2025-05-07T19:58:35.9220664Z [94/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/bf16i4bf16.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16.cu.o 2025-05-07T19:58:35.9237845Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:58:35.9240194Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:35.9241935Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:58:35.9242936Z ^ 2025-05-07T19:58:35.9243173Z 2025-05-07T19:58:35.9243521Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:58:35.9244022Z 2025-05-07T19:58:35.9245316Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:58:35.9247068Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:58:35.9247689Z ^ 2025-05-07T19:58:35.9247922Z 2025-05-07T19:59:41.3933004Z [95/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu.o 2025-05-07T19:59:41.3950615Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T19:59:41.3953068Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.3954801Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.3955788Z ^ 2025-05-07T19:59:41.3956033Z 2025-05-07T19:59:41.3956377Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.3956904Z 2025-05-07T19:59:41.3958163Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.3959914Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T19:59:41.3960537Z ^ 2025-05-07T19:59:41.3960773Z 2025-05-07T19:59:41.3961989Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.3974826Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.3975530Z ^ 2025-05-07T19:59:41.3975900Z detected during: 2025-05-07T19:59:41.3997383Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:59:41.4036434Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:59:41.4076827Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:59:41.4100048Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:59:41.4101851Z 2025-05-07T19:59:41.4102196Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.4102717Z 2025-05-07T19:59:41.4103935Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.4105641Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.4106238Z ^ 2025-05-07T19:59:41.4106603Z detected during: 2025-05-07T19:59:41.4127289Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:59:41.4166606Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:59:41.4207236Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:59:41.4230598Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:59:41.4232530Z 2025-05-07T19:59:41.4232899Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.4233403Z 2025-05-07T19:59:41.4234609Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.4236374Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.4237020Z ^ 2025-05-07T19:59:41.4237394Z detected during: 2025-05-07T19:59:41.4257679Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:59:41.4297224Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:59:41.4337284Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:59:41.4360188Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:59:41.4361917Z 2025-05-07T19:59:41.4362302Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.4362835Z 2025-05-07T19:59:41.4364600Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 889; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4368179Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 896; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4372150Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 903; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4375662Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 910; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4379148Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 1044; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4382719Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 1051; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4386726Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 1058; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4390180Z ptxas /tmp/tmpxft_00000ecd_00000000-9_f4f4bf16_128_192_2_2_1_f.compute_90.ptx, line 1065; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T19:59:41.4393440Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.4395193Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.4395785Z ^ 2025-05-07T19:59:41.4396621Z detected during: 2025-05-07T19:59:41.4416512Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:59:41.4455791Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:59:41.4496634Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:59:41.4519803Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:59:41.4521535Z 2025-05-07T19:59:41.4521883Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.4522374Z 2025-05-07T19:59:41.4523597Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.4525371Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.4526024Z ^ 2025-05-07T19:59:41.4526393Z detected during: 2025-05-07T19:59:41.4547304Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:59:41.4585916Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:59:41.4625632Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:59:41.4648631Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:59:41.4650387Z 2025-05-07T19:59:41.4650762Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.4651287Z 2025-05-07T19:59:41.4652533Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T19:59:41.4654298Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T19:59:41.4654951Z ^ 2025-05-07T19:59:41.4655309Z detected during: 2025-05-07T19:59:41.4675802Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T19:59:41.4714587Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T19:59:41.4754598Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T19:59:41.4777324Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu 2025-05-07T19:59:41.4778990Z 2025-05-07T19:59:41.4779337Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T19:59:41.4779847Z 2025-05-07T20:00:23.4641146Z [96/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu.o 2025-05-07T20:00:23.4658594Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:00:23.4660966Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.4662664Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.4663284Z ^ 2025-05-07T20:00:23.4663551Z 2025-05-07T20:00:23.4663919Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.4664463Z 2025-05-07T20:00:23.4665719Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.4667491Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:00:23.4668184Z ^ 2025-05-07T20:00:23.4668424Z 2025-05-07T20:00:23.4669626Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.4671310Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.4672262Z ^ 2025-05-07T20:00:23.4672638Z detected during: 2025-05-07T20:00:23.4694159Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:23.4733870Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:23.4774214Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:23.4797837Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T20:00:23.4799584Z 2025-05-07T20:00:23.4799953Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.4800524Z 2025-05-07T20:00:23.4801737Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.4803473Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.4804104Z ^ 2025-05-07T20:00:23.4804493Z detected during: 2025-05-07T20:00:23.4825477Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:23.4866429Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:23.4907468Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:23.4930689Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T20:00:23.4932448Z 2025-05-07T20:00:23.4932814Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.4933336Z 2025-05-07T20:00:23.4934588Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.4936326Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.4936981Z ^ 2025-05-07T20:00:23.4937350Z detected during: 2025-05-07T20:00:23.4958690Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:23.4998530Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:23.5039320Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:23.5061959Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T20:00:23.5063686Z 2025-05-07T20:00:23.5064047Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.5064571Z 2025-05-07T20:00:23.5066378Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 889; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5070001Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 896; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5074030Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 903; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5077266Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 910; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5080766Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 1044; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5084681Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 1051; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5088734Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 1058; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5092253Z ptxas /tmp/tmpxft_00000ef7_00000000-9_f4f4bf16_128_192_2_2_1_t.compute_90.ptx, line 1065; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:23.5095343Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.5097048Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.5097673Z ^ 2025-05-07T20:00:23.5098036Z detected during: 2025-05-07T20:00:23.5119283Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:23.5158960Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:23.5199902Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:23.5223045Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T20:00:23.5224811Z 2025-05-07T20:00:23.5225167Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.5225689Z 2025-05-07T20:00:23.5226898Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.5228649Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.5229259Z ^ 2025-05-07T20:00:23.5229626Z detected during: 2025-05-07T20:00:23.5250560Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:23.5290054Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:23.5330200Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:23.5353701Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T20:00:23.5355491Z 2025-05-07T20:00:23.5355861Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.5356380Z 2025-05-07T20:00:23.5357582Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:23.5359404Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:23.5360068Z ^ 2025-05-07T20:00:23.5360806Z detected during: 2025-05-07T20:00:23.5381568Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:23.5422315Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:23.5462636Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:23.5486488Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu 2025-05-07T20:00:23.5488357Z 2025-05-07T20:00:23.5488727Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:23.5489276Z 2025-05-07T20:00:56.2096367Z [97/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu.o 2025-05-07T20:00:56.2113995Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:00:56.2116220Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2118184Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2118805Z ^ 2025-05-07T20:00:56.2119038Z 2025-05-07T20:00:56.2119363Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.2119845Z 2025-05-07T20:00:56.2121055Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2122770Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:00:56.2123357Z ^ 2025-05-07T20:00:56.2123587Z 2025-05-07T20:00:56.2124735Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2126404Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2127000Z ^ 2025-05-07T20:00:56.2127337Z detected during: 2025-05-07T20:00:56.2147213Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2185945Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2224386Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2246644Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2248708Z 2025-05-07T20:00:56.2249088Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.2249581Z 2025-05-07T20:00:56.2250769Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2252384Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2252915Z ^ 2025-05-07T20:00:56.2253226Z detected during: 2025-05-07T20:00:56.2272563Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:00:56.2311804Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2349800Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2389169Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2411243Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2412919Z 2025-05-07T20:00:56.2414095Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2415761Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2416369Z ^ 2025-05-07T20:00:56.2416729Z detected during: 2025-05-07T20:00:56.2436663Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2474441Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2513300Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2535287Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2536932Z 2025-05-07T20:00:56.2537276Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.2537759Z 2025-05-07T20:00:56.2538949Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2540581Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2541125Z ^ 2025-05-07T20:00:56.2541419Z detected during: 2025-05-07T20:00:56.2560550Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:00:56.2594949Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2619251Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2650633Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2664740Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2665823Z 2025-05-07T20:00:56.2666605Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2667725Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2668124Z ^ 2025-05-07T20:00:56.2668367Z detected during: 2025-05-07T20:00:56.2681204Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2706002Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2731462Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2754814Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2755896Z 2025-05-07T20:00:56.2756135Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.2756461Z 2025-05-07T20:00:56.2757348Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2758418Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2758784Z ^ 2025-05-07T20:00:56.2758987Z detected during: 2025-05-07T20:00:56.2771119Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:00:56.2796433Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2820667Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2845528Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2859669Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2860746Z 2025-05-07T20:00:56.2861844Z ptxas /tmp/tmpxft_00000f19_00000000-9_f4f4bf16_128_256_2_1_1_f.compute_90.ptx, line 835; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:56.2864058Z ptxas /tmp/tmpxft_00000f19_00000000-9_f4f4bf16_128_256_2_1_1_f.compute_90.ptx, line 848; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:56.2866358Z ptxas /tmp/tmpxft_00000f19_00000000-9_f4f4bf16_128_256_2_1_1_f.compute_90.ptx, line 988; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:56.2868571Z ptxas /tmp/tmpxft_00000f19_00000000-9_f4f4bf16_128_256_2_1_1_f.compute_90.ptx, line 1001; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:00:56.2870507Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2871594Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2872136Z ^ 2025-05-07T20:00:56.2872367Z detected during: 2025-05-07T20:00:56.2885636Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.2909855Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.2934532Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.2948573Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.2949652Z 2025-05-07T20:00:56.2949884Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.2950208Z 2025-05-07T20:00:56.2950983Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.2952155Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.2952516Z ^ 2025-05-07T20:00:56.2952735Z detected during: 2025-05-07T20:00:56.2964873Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:00:56.2990094Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.3014541Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.3039110Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.3053203Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.3054285Z 2025-05-07T20:00:56.3055136Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.3056226Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.3056621Z ^ 2025-05-07T20:00:56.3056857Z detected during: 2025-05-07T20:00:56.3069535Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.3094384Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.3119075Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.3133096Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.3134178Z 2025-05-07T20:00:56.3134403Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.3134726Z 2025-05-07T20:00:56.3135503Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.3136552Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.3136998Z ^ 2025-05-07T20:00:56.3137198Z detected during: 2025-05-07T20:00:56.3149302Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:00:56.3174140Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.3198790Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.3223456Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.3237789Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.3238891Z 2025-05-07T20:00:56.3239699Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.3240794Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.3241189Z ^ 2025-05-07T20:00:56.3241425Z detected during: 2025-05-07T20:00:56.3254131Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.3278419Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.3303398Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.3317629Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.3318707Z 2025-05-07T20:00:56.3318930Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:00:56.3319341Z 2025-05-07T20:00:56.3320113Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:00:56.3321171Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:00:56.3321533Z ^ 2025-05-07T20:00:56.3321737Z detected during: 2025-05-07T20:00:56.3333909Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:00:56.3358830Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:00:56.3383110Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:00:56.3408373Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:00:56.3422481Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu 2025-05-07T20:00:56.3423557Z 2025-05-07T20:01:47.6885421Z [98/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu.o 2025-05-07T20:01:47.6902414Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:01:47.6904594Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.6906570Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.6907181Z ^ 2025-05-07T20:01:47.6907406Z 2025-05-07T20:01:47.6907744Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.6908233Z 2025-05-07T20:01:47.6909416Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.6911062Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:01:47.6911610Z ^ 2025-05-07T20:01:47.6912029Z 2025-05-07T20:01:47.6913177Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.6914849Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.6915432Z ^ 2025-05-07T20:01:47.6915786Z detected during: 2025-05-07T20:01:47.6934987Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.6971867Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7010078Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7031756Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7033813Z 2025-05-07T20:01:47.7034153Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.7034619Z 2025-05-07T20:01:47.7035808Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7037441Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7037974Z ^ 2025-05-07T20:01:47.7038275Z detected during: 2025-05-07T20:01:47.7056807Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:01:47.7094949Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7131286Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7170556Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7193910Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7195569Z 2025-05-07T20:01:47.7196778Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7198409Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7199003Z ^ 2025-05-07T20:01:47.7199350Z detected during: 2025-05-07T20:01:47.7218771Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7255572Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7293929Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7315757Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7317407Z 2025-05-07T20:01:47.7317742Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.7318237Z 2025-05-07T20:01:47.7319416Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7321063Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7321616Z ^ 2025-05-07T20:01:47.7321918Z detected during: 2025-05-07T20:01:47.7340726Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:01:47.7378758Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7416571Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7454145Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7475658Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7477367Z 2025-05-07T20:01:47.7478543Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7480191Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7480788Z ^ 2025-05-07T20:01:47.7481150Z detected during: 2025-05-07T20:01:47.7501107Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7538350Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7575565Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7597450Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7599140Z 2025-05-07T20:01:47.7599472Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.7599953Z 2025-05-07T20:01:47.7601440Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7603059Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7603599Z ^ 2025-05-07T20:01:47.7603895Z detected during: 2025-05-07T20:01:47.7622164Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:01:47.7659999Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7697558Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7734844Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7756399Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7758039Z 2025-05-07T20:01:47.7759683Z ptxas /tmp/tmpxft_00000f3b_00000000-9_f4f4bf16_128_256_2_1_1_t.compute_90.ptx, line 835; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:01:47.7763087Z ptxas /tmp/tmpxft_00000f3b_00000000-9_f4f4bf16_128_256_2_1_1_t.compute_90.ptx, line 848; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:01:47.7766723Z ptxas /tmp/tmpxft_00000f3b_00000000-9_f4f4bf16_128_256_2_1_1_t.compute_90.ptx, line 988; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:01:47.7770157Z ptxas /tmp/tmpxft_00000f3b_00000000-9_f4f4bf16_128_256_2_1_1_t.compute_90.ptx, line 1001; warning : Advisory: '.multicast::cluster' modifier on instruction 'cp.async.bulk{.tensor}' should be used on .target 'sm_90a/sm_100a/sm_101a' instead of .target 'sm_90' as this feature is expected to have substantially reduced performance on some future architectures 2025-05-07T20:01:47.7773084Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7774762Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7775365Z ^ 2025-05-07T20:01:47.7775719Z detected during: 2025-05-07T20:01:47.7796392Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7833221Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.7870261Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.7892407Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.7894066Z 2025-05-07T20:01:47.7894405Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.7894890Z 2025-05-07T20:01:47.7896060Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.7897685Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.7898227Z ^ 2025-05-07T20:01:47.7898511Z detected during: 2025-05-07T20:01:47.7917126Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:01:47.7954867Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.7992330Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.8029528Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.8051199Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.8052851Z 2025-05-07T20:01:47.8054286Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.8055933Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.8056546Z ^ 2025-05-07T20:01:47.8056898Z detected during: 2025-05-07T20:01:47.8076434Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.8113724Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.8150973Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.8172397Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.8174070Z 2025-05-07T20:01:47.8174416Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.8174899Z 2025-05-07T20:01:47.8176063Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.8177725Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.8178533Z ^ 2025-05-07T20:01:47.8178833Z detected during: 2025-05-07T20:01:47.8197999Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:01:47.8235988Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.8272917Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.8310806Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.8332824Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.8334472Z 2025-05-07T20:01:47.8335671Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.8337342Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.8337954Z ^ 2025-05-07T20:01:47.8338303Z detected during: 2025-05-07T20:01:47.8357847Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.8394936Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.8431492Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.8453130Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.8454757Z 2025-05-07T20:01:47.8455088Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:01:47.8455784Z 2025-05-07T20:01:47.8456950Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:01:47.8458558Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:01:47.8459095Z ^ 2025-05-07T20:01:47.8459401Z detected during: 2025-05-07T20:01:47.8478233Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=4, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:01:47.8516792Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:01:47.8565055Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:01:47.8602877Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:01:47.8624157Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=128, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu 2025-05-07T20:01:47.8625808Z 2025-05-07T20:03:24.7898250Z [99/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu.o 2025-05-07T20:03:24.7916013Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:03:24.7918244Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.7920409Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.7921065Z ^ 2025-05-07T20:03:24.7921303Z 2025-05-07T20:03:24.7921645Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.7922150Z 2025-05-07T20:03:24.7923338Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.7925034Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:03:24.7925625Z ^ 2025-05-07T20:03:24.7925865Z 2025-05-07T20:03:24.7927063Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.7928716Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.7929309Z ^ 2025-05-07T20:03:24.7929656Z detected during: 2025-05-07T20:03:24.7949902Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.7988993Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8028604Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8051058Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8052760Z 2025-05-07T20:03:24.8053100Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.8053588Z 2025-05-07T20:03:24.8054731Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8056322Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8056866Z ^ 2025-05-07T20:03:24.8057165Z detected during: 2025-05-07T20:03:24.8076561Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:24.8116853Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.8155702Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8196355Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8220566Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8222244Z 2025-05-07T20:03:24.8223439Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8225110Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8225714Z ^ 2025-05-07T20:03:24.8226061Z detected during: 2025-05-07T20:03:24.8246893Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.8286151Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8325370Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8347610Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8349290Z 2025-05-07T20:03:24.8349612Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.8350095Z 2025-05-07T20:03:24.8351302Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8353172Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8353711Z ^ 2025-05-07T20:03:24.8354014Z detected during: 2025-05-07T20:03:24.8372854Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:24.8412737Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.8451589Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8491161Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8513517Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8515148Z 2025-05-07T20:03:24.8516363Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8517996Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8518596Z ^ 2025-05-07T20:03:24.8518959Z detected during: 2025-05-07T20:03:24.8538933Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.8576923Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8616432Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8638449Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8640133Z 2025-05-07T20:03:24.8640483Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.8640967Z 2025-05-07T20:03:24.8642133Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8643759Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8644317Z ^ 2025-05-07T20:03:24.8644607Z detected during: 2025-05-07T20:03:24.8663524Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:24.8703254Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.8741767Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8781014Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8804886Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8806611Z 2025-05-07T20:03:24.8807865Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8809579Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8810220Z ^ 2025-05-07T20:03:24.8810563Z detected during: 2025-05-07T20:03:24.8830802Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.8869720Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.8909729Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.8932472Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.8934374Z 2025-05-07T20:03:24.8934712Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.8935213Z 2025-05-07T20:03:24.8936421Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.8938090Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.8938639Z ^ 2025-05-07T20:03:24.8938942Z detected during: 2025-05-07T20:03:24.8958317Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:24.8998515Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.9038382Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.9077802Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.9100905Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.9102621Z 2025-05-07T20:03:24.9103831Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.9105542Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.9106161Z ^ 2025-05-07T20:03:24.9106520Z detected during: 2025-05-07T20:03:24.9127395Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.9166318Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.9206535Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.9228722Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.9230440Z 2025-05-07T20:03:24.9230778Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.9231284Z 2025-05-07T20:03:24.9232978Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.9234610Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.9235163Z ^ 2025-05-07T20:03:24.9235466Z detected during: 2025-05-07T20:03:24.9254591Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:24.9294667Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.9333553Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.9373053Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.9396220Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.9397903Z 2025-05-07T20:03:24.9399526Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.9401224Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.9401839Z ^ 2025-05-07T20:03:24.9402192Z detected during: 2025-05-07T20:03:24.9422640Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.9461311Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.9500528Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.9520825Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.9522622Z 2025-05-07T20:03:24.9522996Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:24.9523509Z 2025-05-07T20:03:24.9524791Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:24.9526531Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:24.9527094Z ^ 2025-05-07T20:03:24.9527414Z detected during: 2025-05-07T20:03:24.9548148Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:24.9591275Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:24.9630638Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:24.9668142Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:24.9689657Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu 2025-05-07T20:03:24.9691223Z 2025-05-07T20:03:55.3266019Z [100/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu.o 2025-05-07T20:03:55.3282937Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:03:55.3285799Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3287432Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.3288015Z ^ 2025-05-07T20:03:55.3288253Z 2025-05-07T20:03:55.3288580Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.3289064Z 2025-05-07T20:03:55.3290309Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3292025Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:03:55.3292589Z ^ 2025-05-07T20:03:55.3292823Z 2025-05-07T20:03:55.3293964Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3295525Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.3296399Z ^ 2025-05-07T20:03:55.3296774Z detected during: 2025-05-07T20:03:55.3316369Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.3353796Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.3392142Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.3414039Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.3415692Z 2025-05-07T20:03:55.3416023Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.3416506Z 2025-05-07T20:03:55.3417638Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3419445Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.3420031Z ^ 2025-05-07T20:03:55.3420330Z detected during: 2025-05-07T20:03:55.3438972Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:55.3476203Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.3513113Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.3551298Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.3573096Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.3574769Z 2025-05-07T20:03:55.3575943Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3577573Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.3578168Z ^ 2025-05-07T20:03:55.3578512Z detected during: 2025-05-07T20:03:55.3598942Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.3637921Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.3676196Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.3698474Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.3700174Z 2025-05-07T20:03:55.3700511Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.3700992Z 2025-05-07T20:03:55.3702173Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3703779Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.3704319Z ^ 2025-05-07T20:03:55.3704645Z detected during: 2025-05-07T20:03:55.3736932Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:55.3775061Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.3813397Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.3852171Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.3874498Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.3876164Z 2025-05-07T20:03:55.3877370Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.3879087Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.3879712Z ^ 2025-05-07T20:03:55.3880084Z detected during: 2025-05-07T20:03:55.3900380Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.3938566Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.3977147Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.3999659Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4001297Z 2025-05-07T20:03:55.4001635Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.4002156Z 2025-05-07T20:03:55.4003328Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4004960Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4005511Z ^ 2025-05-07T20:03:55.4005822Z detected during: 2025-05-07T20:03:55.4024532Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:55.4063356Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4102038Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.4140541Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.4175108Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4176790Z 2025-05-07T20:03:55.4177965Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4179690Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4180292Z ^ 2025-05-07T20:03:55.4180644Z detected during: 2025-05-07T20:03:55.4201402Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4239206Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.4279518Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.4302147Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4303873Z 2025-05-07T20:03:55.4304222Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.4304743Z 2025-05-07T20:03:55.4305929Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4307588Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4308145Z ^ 2025-05-07T20:03:55.4308449Z detected during: 2025-05-07T20:03:55.4327771Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:55.4366864Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4405268Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.4443814Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.4465895Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4467545Z 2025-05-07T20:03:55.4468717Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4470374Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4470971Z ^ 2025-05-07T20:03:55.4471324Z detected during: 2025-05-07T20:03:55.4491842Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4529394Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.4567798Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.4590270Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4592449Z 2025-05-07T20:03:55.4592799Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.4593290Z 2025-05-07T20:03:55.4594501Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4596129Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4596668Z ^ 2025-05-07T20:03:55.4596971Z detected during: 2025-05-07T20:03:55.4615803Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:55.4654418Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4693221Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.4732026Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.4754216Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4755779Z 2025-05-07T20:03:55.4756909Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4758534Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4759144Z ^ 2025-05-07T20:03:55.4759494Z detected during: 2025-05-07T20:03:55.4779340Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4818278Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.4857463Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.4879633Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.4881336Z 2025-05-07T20:03:55.4881693Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:55.4882508Z 2025-05-07T20:03:55.4883707Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:55.4885934Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:55.4886509Z ^ 2025-05-07T20:03:55.4886803Z detected during: 2025-05-07T20:03:55.4909129Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:55.4947748Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:55.4986177Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:55.5024997Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:55.5047188Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu 2025-05-07T20:03:55.5049038Z 2025-05-07T20:03:56.4982002Z [101/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu.o 2025-05-07T20:03:56.4999616Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:03:56.5001849Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5003515Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5004119Z ^ 2025-05-07T20:03:56.5004348Z 2025-05-07T20:03:56.5004688Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.5005185Z 2025-05-07T20:03:56.5006388Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5008030Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:03:56.5008631Z ^ 2025-05-07T20:03:56.5009215Z 2025-05-07T20:03:56.5010423Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5012233Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5012830Z ^ 2025-05-07T20:03:56.5013182Z detected during: 2025-05-07T20:03:56.5033744Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5072464Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5111602Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5133943Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5135635Z 2025-05-07T20:03:56.5135987Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.5136490Z 2025-05-07T20:03:56.5137663Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5139275Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5139838Z ^ 2025-05-07T20:03:56.5140142Z detected during: 2025-05-07T20:03:56.5159591Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:56.5198764Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5237246Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5277861Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5300206Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5301905Z 2025-05-07T20:03:56.5303080Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5305087Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5305686Z ^ 2025-05-07T20:03:56.5306047Z detected during: 2025-05-07T20:03:56.5326088Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5364075Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5402892Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5425015Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5426690Z 2025-05-07T20:03:56.5427037Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.5427524Z 2025-05-07T20:03:56.5428734Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5430498Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5431173Z ^ 2025-05-07T20:03:56.5431470Z detected during: 2025-05-07T20:03:56.5450490Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:56.5490101Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5528195Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5567392Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5590293Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5592151Z 2025-05-07T20:03:56.5593340Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5595006Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5595603Z ^ 2025-05-07T20:03:56.5595960Z detected during: 2025-05-07T20:03:56.5616267Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5655358Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5694935Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5717850Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5719537Z 2025-05-07T20:03:56.5719899Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.5720397Z 2025-05-07T20:03:56.5721598Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5723209Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5723765Z ^ 2025-05-07T20:03:56.5724057Z detected during: 2025-05-07T20:03:56.5743090Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:56.5782456Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5809436Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5854839Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5869423Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5870513Z 2025-05-07T20:03:56.5871296Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5872563Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5872961Z ^ 2025-05-07T20:03:56.5873260Z detected during: 2025-05-07T20:03:56.5887222Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.5912145Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.5937445Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.5952070Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.5953161Z 2025-05-07T20:03:56.5953389Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.5953718Z 2025-05-07T20:03:56.5954504Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.5955564Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.5955937Z ^ 2025-05-07T20:03:56.5956144Z detected during: 2025-05-07T20:03:56.5968380Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:56.5994066Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.6018900Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.6044201Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.6058702Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.6059777Z 2025-05-07T20:03:56.6060571Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.6061654Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.6062065Z ^ 2025-05-07T20:03:56.6062302Z detected during: 2025-05-07T20:03:56.6075389Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.6100895Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.6126219Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.6140707Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.6141804Z 2025-05-07T20:03:56.6142028Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.6142358Z 2025-05-07T20:03:56.6143133Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.6144193Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.6144554Z ^ 2025-05-07T20:03:56.6144766Z detected during: 2025-05-07T20:03:56.6157065Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:56.6182253Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.6207903Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.6233424Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.6247833Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.6248911Z 2025-05-07T20:03:56.6249727Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.6250816Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.6251210Z ^ 2025-05-07T20:03:56.6251449Z detected during: 2025-05-07T20:03:56.6264394Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.6291277Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.6316595Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.6331018Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.6332141Z 2025-05-07T20:03:56.6332367Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:03:56.6332693Z 2025-05-07T20:03:56.6333475Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:03:56.6334531Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:03:56.6334893Z ^ 2025-05-07T20:03:56.6335103Z detected during: 2025-05-07T20:03:56.6347313Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:03:56.6372702Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:03:56.6398107Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:03:56.6423413Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:03:56.6437967Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu 2025-05-07T20:03:56.6439045Z 2025-05-07T20:04:28.3135984Z [102/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu.o 2025-05-07T20:04:28.3153859Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:04:28.3156193Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3157757Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3158330Z ^ 2025-05-07T20:04:28.3158570Z 2025-05-07T20:04:28.3158879Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.3159393Z 2025-05-07T20:04:28.3160673Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3162402Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:04:28.3162948Z ^ 2025-05-07T20:04:28.3163171Z 2025-05-07T20:04:28.3164317Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3166043Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3166677Z ^ 2025-05-07T20:04:28.3167048Z detected during: 2025-05-07T20:04:28.3188360Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.3228613Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.3268860Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.3292366Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.3294126Z 2025-05-07T20:04:28.3294481Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.3294999Z 2025-05-07T20:04:28.3296398Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3298079Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3298639Z ^ 2025-05-07T20:04:28.3298955Z detected during: 2025-05-07T20:04:28.3318167Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:04:28.3358460Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.3398403Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.3439428Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.3462285Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.3464193Z 2025-05-07T20:04:28.3465431Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3467147Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3467786Z ^ 2025-05-07T20:04:28.3468161Z detected during: 2025-05-07T20:04:28.3489241Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.3529282Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.3569387Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.3592537Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.3594257Z 2025-05-07T20:04:28.3594605Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.3595110Z 2025-05-07T20:04:28.3596312Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3597953Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3598510Z ^ 2025-05-07T20:04:28.3598823Z detected during: 2025-05-07T20:04:28.3619513Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:04:28.3660477Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.3702142Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.3742636Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.3766680Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.3768514Z 2025-05-07T20:04:28.3769934Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3771747Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3772403Z ^ 2025-05-07T20:04:28.3772778Z detected during: 2025-05-07T20:04:28.3794549Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.3833582Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.3872826Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.3895753Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.3897727Z 2025-05-07T20:04:28.3898085Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.3898603Z 2025-05-07T20:04:28.3899790Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.3901668Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.3902274Z ^ 2025-05-07T20:04:28.3902609Z detected during: 2025-05-07T20:04:28.3922308Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:04:28.3962459Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4002798Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4043333Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4066416Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4068138Z 2025-05-07T20:04:28.4069358Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.4071150Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.4071925Z ^ 2025-05-07T20:04:28.4072296Z detected during: 2025-05-07T20:04:28.4093433Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4132889Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4173198Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4196448Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4198179Z 2025-05-07T20:04:28.4198551Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.4199066Z 2025-05-07T20:04:28.4200548Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.4202249Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.4202828Z ^ 2025-05-07T20:04:28.4203143Z detected during: 2025-05-07T20:04:28.4222719Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:04:28.4263009Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4303424Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4343863Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4367167Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4368927Z 2025-05-07T20:04:28.4370180Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.4372029Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.4372758Z ^ 2025-05-07T20:04:28.4373136Z detected during: 2025-05-07T20:04:28.4394409Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4433820Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4474070Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4497173Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4498920Z 2025-05-07T20:04:28.4499268Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.4499776Z 2025-05-07T20:04:28.4501018Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.4502702Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.4503275Z ^ 2025-05-07T20:04:28.4503579Z detected during: 2025-05-07T20:04:28.4523187Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:04:28.4563622Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4603017Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4643284Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4666270Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4667976Z 2025-05-07T20:04:28.4669199Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.4670950Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.4671582Z ^ 2025-05-07T20:04:28.4672111Z detected during: 2025-05-07T20:04:28.4693678Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4733105Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4772529Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4795080Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4796762Z 2025-05-07T20:04:28.4797141Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:04:28.4797637Z 2025-05-07T20:04:28.4798852Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:04:28.4800550Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:04:28.4801441Z ^ 2025-05-07T20:04:28.4801751Z detected during: 2025-05-07T20:04:28.4820970Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=13, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=2, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<128>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:04:28.4860701Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:04:28.4899785Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:04:28.4938983Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b4x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:04:28.4961684Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=128, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu 2025-05-07T20:04:28.4963339Z 2025-05-07T20:06:25.9534467Z [103/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu.o 2025-05-07T20:06:25.9552599Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:06:25.9554918Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:25.9556867Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:25.9557517Z ^ 2025-05-07T20:06:25.9557751Z 2025-05-07T20:06:25.9558088Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:25.9558595Z 2025-05-07T20:06:25.9559814Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:25.9561824Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:06:25.9562468Z ^ 2025-05-07T20:06:25.9562723Z 2025-05-07T20:06:25.9563951Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:25.9565680Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:25.9566311Z ^ 2025-05-07T20:06:25.9566683Z detected during: 2025-05-07T20:06:25.9587684Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:06:25.9627026Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:06:25.9666534Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:06:25.9689885Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T20:06:25.9691589Z 2025-05-07T20:06:25.9691954Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:25.9692471Z 2025-05-07T20:06:25.9693685Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:25.9695450Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:25.9696108Z ^ 2025-05-07T20:06:25.9696469Z detected during: 2025-05-07T20:06:25.9717804Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:06:25.9757348Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:06:25.9797229Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:06:25.9819397Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T20:06:25.9821103Z 2025-05-07T20:06:25.9821450Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:25.9821978Z 2025-05-07T20:06:25.9823176Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:25.9824873Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:25.9825471Z ^ 2025-05-07T20:06:25.9825841Z detected during: 2025-05-07T20:06:25.9846408Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:06:25.9885696Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:06:25.9926070Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:06:25.9949051Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T20:06:25.9950818Z 2025-05-07T20:06:25.9951171Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:25.9951687Z 2025-05-07T20:06:25.9953007Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:25.9954762Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:25.9955380Z ^ 2025-05-07T20:06:25.9955752Z detected during: 2025-05-07T20:06:25.9976353Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:06:26.0015781Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:06:26.0055799Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:06:26.0079008Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T20:06:26.0080700Z 2025-05-07T20:06:26.0081038Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:26.0081562Z 2025-05-07T20:06:26.0082804Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:26.0084537Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:26.0085522Z ^ 2025-05-07T20:06:26.0085908Z detected during: 2025-05-07T20:06:26.0106867Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:06:26.0147213Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:06:26.0188173Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:06:26.0210751Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T20:06:26.0212514Z 2025-05-07T20:06:26.0212886Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:26.0213410Z 2025-05-07T20:06:26.0214669Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:06:26.0216386Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:06:26.0217025Z ^ 2025-05-07T20:06:26.0217583Z detected during: 2025-05-07T20:06:26.0238528Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:06:26.0277597Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:06:26.0318202Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:06:26.0341242Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu 2025-05-07T20:06:26.0342985Z 2025-05-07T20:06:26.0343333Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:06:26.0343827Z 2025-05-07T20:07:07.7959742Z [104/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu.o 2025-05-07T20:07:07.7977982Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:07:07.7980245Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.7982153Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.7982823Z ^ 2025-05-07T20:07:07.7983075Z 2025-05-07T20:07:07.7983427Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.7983955Z 2025-05-07T20:07:07.7985888Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.7987695Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:07:07.7988330Z ^ 2025-05-07T20:07:07.7988580Z 2025-05-07T20:07:07.7989822Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.7991555Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.7992333Z ^ 2025-05-07T20:07:07.7992712Z detected during: 2025-05-07T20:07:07.8012940Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:07.8052270Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:07.8092763Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:07.8115966Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T20:07:07.8117665Z 2025-05-07T20:07:07.8118013Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.8118523Z 2025-05-07T20:07:07.8119723Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.8121432Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.8122031Z ^ 2025-05-07T20:07:07.8122389Z detected during: 2025-05-07T20:07:07.8143036Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:07.8182622Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:07.8223242Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:07.8246632Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T20:07:07.8248370Z 2025-05-07T20:07:07.8248730Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.8249252Z 2025-05-07T20:07:07.8250470Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.8252201Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.8252860Z ^ 2025-05-07T20:07:07.8253417Z detected during: 2025-05-07T20:07:07.8273924Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:07.8313493Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:07.8353739Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:07.8376850Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T20:07:07.8378587Z 2025-05-07T20:07:07.8378942Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.8379475Z 2025-05-07T20:07:07.8380717Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.8382478Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.8383112Z ^ 2025-05-07T20:07:07.8383471Z detected during: 2025-05-07T20:07:07.8404788Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:07.8443917Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:07.8486131Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:07.8509259Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T20:07:07.8510963Z 2025-05-07T20:07:07.8511314Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.8511826Z 2025-05-07T20:07:07.8513272Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.8515006Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.8515632Z ^ 2025-05-07T20:07:07.8515951Z detected during: 2025-05-07T20:07:07.8536415Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:07.8573223Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:07.8613164Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:07.8636188Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T20:07:07.8637966Z 2025-05-07T20:07:07.8638327Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.8638840Z 2025-05-07T20:07:07.8640065Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:07.8641786Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:07.8642427Z ^ 2025-05-07T20:07:07.8643010Z detected during: 2025-05-07T20:07:07.8663431Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:07.8703448Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:07.8742990Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:07.8765671Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu 2025-05-07T20:07:07.8767391Z 2025-05-07T20:07:07.8767747Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:07.8768232Z 2025-05-07T20:07:39.6185574Z [105/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu.o 2025-05-07T20:07:39.6201556Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:07:39.6203523Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6205614Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6206165Z ^ 2025-05-07T20:07:39.6206369Z 2025-05-07T20:07:39.6206793Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6207230Z 2025-05-07T20:07:39.6208683Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6210183Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:07:39.6210706Z ^ 2025-05-07T20:07:39.6210910Z 2025-05-07T20:07:39.6211941Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6213410Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6213929Z ^ 2025-05-07T20:07:39.6214233Z detected during: 2025-05-07T20:07:39.6232578Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:39.6266680Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:39.6302603Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:39.6323064Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T20:07:39.6324589Z 2025-05-07T20:07:39.6324905Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6325337Z 2025-05-07T20:07:39.6326408Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6327915Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6328448Z ^ 2025-05-07T20:07:39.6328750Z detected during: 2025-05-07T20:07:39.6347064Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:39.6382236Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:39.6418043Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:39.6438473Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T20:07:39.6440025Z 2025-05-07T20:07:39.6441873Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6442310Z 2025-05-07T20:07:39.6443365Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6444878Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6445635Z ^ 2025-05-07T20:07:39.6445961Z detected during: 2025-05-07T20:07:39.6463805Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:39.6493186Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:39.6519496Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:39.6535645Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T20:07:39.6536765Z 2025-05-07T20:07:39.6536992Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6537327Z 2025-05-07T20:07:39.6538112Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6539214Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6539625Z ^ 2025-05-07T20:07:39.6539864Z detected during: 2025-05-07T20:07:39.6554943Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:39.6582014Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:39.6608696Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:39.6624337Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T20:07:39.6625509Z 2025-05-07T20:07:39.6625812Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6626154Z 2025-05-07T20:07:39.6626948Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6628098Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6628553Z ^ 2025-05-07T20:07:39.6628797Z detected during: 2025-05-07T20:07:39.6642343Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:39.6669807Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:39.6696507Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:39.6720613Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T20:07:39.6721828Z 2025-05-07T20:07:39.6722077Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6722414Z 2025-05-07T20:07:39.6723197Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:07:39.6724452Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:07:39.6725029Z ^ 2025-05-07T20:07:39.6725274Z detected during: 2025-05-07T20:07:39.6738797Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:07:39.6765451Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:07:39.6793293Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:07:39.6809200Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu 2025-05-07T20:07:39.6810314Z 2025-05-07T20:07:39.6810541Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:07:39.6810865Z 2025-05-07T20:08:31.7814094Z [106/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu.o 2025-05-07T20:08:31.7832316Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:08:31.7835727Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.7837531Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.7838169Z ^ 2025-05-07T20:08:31.7838399Z 2025-05-07T20:08:31.7838724Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.7839211Z 2025-05-07T20:08:31.7840632Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.7842382Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:08:31.7843060Z ^ 2025-05-07T20:08:31.7843316Z 2025-05-07T20:08:31.7844566Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.7846286Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.7846913Z ^ 2025-05-07T20:08:31.7847274Z detected during: 2025-05-07T20:08:31.7867859Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:08:31.7906648Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:08:31.7946615Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:08:31.7969053Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T20:08:31.7970756Z 2025-05-07T20:08:31.7971119Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.7971655Z 2025-05-07T20:08:31.7972882Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.7974624Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.7975251Z ^ 2025-05-07T20:08:31.7975627Z detected during: 2025-05-07T20:08:31.7997085Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:08:31.8036771Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:08:31.8076708Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:08:31.8099774Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T20:08:31.8101534Z 2025-05-07T20:08:31.8101887Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.8102396Z 2025-05-07T20:08:31.8103607Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.8105356Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.8106336Z ^ 2025-05-07T20:08:31.8106695Z detected during: 2025-05-07T20:08:31.8127965Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:08:31.8166301Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:08:31.8205535Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:08:31.8227725Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T20:08:31.8229469Z 2025-05-07T20:08:31.8229827Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.8230329Z 2025-05-07T20:08:31.8231532Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.8233506Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.8234155Z ^ 2025-05-07T20:08:31.8234526Z detected during: 2025-05-07T20:08:31.8254907Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:08:31.8294667Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:08:31.8334476Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:08:31.8357555Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T20:08:31.8359221Z 2025-05-07T20:08:31.8359585Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.8360095Z 2025-05-07T20:08:31.8361315Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.8363009Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.8363616Z ^ 2025-05-07T20:08:31.8364196Z detected during: 2025-05-07T20:08:31.8385866Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:08:31.8425109Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:08:31.8465199Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:08:31.8488648Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T20:08:31.8490350Z 2025-05-07T20:08:31.8490718Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.8491228Z 2025-05-07T20:08:31.8492465Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:08:31.8494499Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:08:31.8495187Z ^ 2025-05-07T20:08:31.8495750Z detected during: 2025-05-07T20:08:31.8516684Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:08:31.8555674Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:08:31.8596126Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:08:31.8619260Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu 2025-05-07T20:08:31.8620957Z 2025-05-07T20:08:31.8621304Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:08:31.8621813Z 2025-05-07T20:10:07.1196132Z [107/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu.o 2025-05-07T20:10:07.1213921Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:10:07.1216591Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1218333Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1218962Z ^ 2025-05-07T20:10:07.1219199Z 2025-05-07T20:10:07.1219538Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.1220047Z 2025-05-07T20:10:07.1221492Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1223237Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:10:07.1223866Z ^ 2025-05-07T20:10:07.1224094Z 2025-05-07T20:10:07.1225345Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1227053Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1227697Z ^ 2025-05-07T20:10:07.1228074Z detected during: 2025-05-07T20:10:07.1248442Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:07.1288006Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:07.1328209Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:07.1350923Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T20:10:07.1352893Z 2025-05-07T20:10:07.1353272Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.1353808Z 2025-05-07T20:10:07.1355021Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1356766Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1357363Z ^ 2025-05-07T20:10:07.1357899Z detected during: 2025-05-07T20:10:07.1378145Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:07.1417949Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:07.1459529Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:07.1482646Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T20:10:07.1484411Z 2025-05-07T20:10:07.1485112Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.1485662Z 2025-05-07T20:10:07.1486887Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1488875Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1489511Z ^ 2025-05-07T20:10:07.1489879Z detected during: 2025-05-07T20:10:07.1510634Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:07.1549485Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:07.1589908Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:07.1613074Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T20:10:07.1614741Z 2025-05-07T20:10:07.1615073Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.1615548Z 2025-05-07T20:10:07.1616787Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1618516Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1619133Z ^ 2025-05-07T20:10:07.1619483Z detected during: 2025-05-07T20:10:07.1640071Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:07.1679109Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:07.1719087Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:07.1742164Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T20:10:07.1743867Z 2025-05-07T20:10:07.1744233Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.1744752Z 2025-05-07T20:10:07.1745939Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1747609Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1748156Z ^ 2025-05-07T20:10:07.1748628Z detected during: 2025-05-07T20:10:07.1769115Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:07.1808429Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:07.1848321Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:07.1870774Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T20:10:07.1872581Z 2025-05-07T20:10:07.1872933Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.1873432Z 2025-05-07T20:10:07.1874713Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:07.1876461Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:07.1877278Z ^ 2025-05-07T20:10:07.1877646Z detected during: 2025-05-07T20:10:07.1898195Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:07.1937111Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:07.1976487Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:07.1999680Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu 2025-05-07T20:10:07.2001450Z 2025-05-07T20:10:07.2001805Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:07.2002306Z 2025-05-07T20:10:13.5157232Z [108/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu.o 2025-05-07T20:10:13.5175190Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:10:13.5177831Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5179543Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5180175Z ^ 2025-05-07T20:10:13.5180410Z 2025-05-07T20:10:13.5180751Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.5181265Z 2025-05-07T20:10:13.5182679Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5184483Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:10:13.5185783Z ^ 2025-05-07T20:10:13.5186018Z 2025-05-07T20:10:13.5187275Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5189018Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5189642Z ^ 2025-05-07T20:10:13.5191554Z detected during: 2025-05-07T20:10:13.5211889Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.5249692Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.5289770Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.5312250Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.5314223Z 2025-05-07T20:10:13.5314585Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.5315065Z 2025-05-07T20:10:13.5316244Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5317922Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5318522Z ^ 2025-05-07T20:10:13.5318843Z detected during: 2025-05-07T20:10:13.5338492Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:13.5377220Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.5415647Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.5454825Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.5477775Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.5479461Z 2025-05-07T20:10:13.5480687Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5482490Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5483254Z ^ 2025-05-07T20:10:13.5494305Z detected during: 2025-05-07T20:10:13.5514662Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.5555171Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.5594864Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.5617295Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.5619041Z 2025-05-07T20:10:13.5619380Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.5619892Z 2025-05-07T20:10:13.5621063Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5622860Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5623367Z ^ 2025-05-07T20:10:13.5623642Z detected during: 2025-05-07T20:10:13.5642937Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:13.5682116Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.5721173Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.5760970Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.5782777Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.5784461Z 2025-05-07T20:10:13.5786410Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5788110Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5788760Z ^ 2025-05-07T20:10:13.5789125Z detected during: 2025-05-07T20:10:13.5809103Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.5846727Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.5886269Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.5908869Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.5910639Z 2025-05-07T20:10:13.5911004Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.5911522Z 2025-05-07T20:10:13.5912978Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.5914652Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.5915211Z ^ 2025-05-07T20:10:13.5915528Z detected during: 2025-05-07T20:10:13.5934452Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:13.5973091Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6011811Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6050725Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6073729Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6075506Z 2025-05-07T20:10:13.6076749Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.6078661Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.6079260Z ^ 2025-05-07T20:10:13.6079620Z detected during: 2025-05-07T20:10:13.6099556Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6138308Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6178057Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6201657Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6203479Z 2025-05-07T20:10:13.6203821Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.6204338Z 2025-05-07T20:10:13.6205539Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.6207247Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.6207826Z ^ 2025-05-07T20:10:13.6208138Z detected during: 2025-05-07T20:10:13.6227899Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:13.6267059Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6305411Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6344733Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6367703Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6369636Z 2025-05-07T20:10:13.6370891Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.6372646Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.6373291Z ^ 2025-05-07T20:10:13.6373664Z detected during: 2025-05-07T20:10:13.6394558Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6432975Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6471671Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6494908Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6496620Z 2025-05-07T20:10:13.6496965Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.6497469Z 2025-05-07T20:10:13.6498669Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.6500351Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.6500931Z ^ 2025-05-07T20:10:13.6501239Z detected during: 2025-05-07T20:10:13.6521279Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:13.6562250Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6601087Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6640668Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6663276Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6665060Z 2025-05-07T20:10:13.6666317Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.6668063Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.6668712Z ^ 2025-05-07T20:10:13.6669075Z detected during: 2025-05-07T20:10:13.6689983Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6727729Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6767009Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6789666Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6791384Z 2025-05-07T20:10:13.6791722Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:13.6792383Z 2025-05-07T20:10:13.6793564Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:13.6795176Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:13.6795726Z ^ 2025-05-07T20:10:13.6796040Z detected during: 2025-05-07T20:10:13.6815333Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:13.6854976Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:13.6894046Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:13.6933491Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:13.6956021Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu 2025-05-07T20:10:13.6957963Z 2025-05-07T20:10:38.1027051Z [109/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu.o 2025-05-07T20:10:38.1038363Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:10:38.1039943Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1041037Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1041438Z ^ 2025-05-07T20:10:38.1041596Z 2025-05-07T20:10:38.1041825Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1042167Z 2025-05-07T20:10:38.1042958Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1044063Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:10:38.1044455Z ^ 2025-05-07T20:10:38.1044616Z 2025-05-07T20:10:38.1045497Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1046579Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1046969Z ^ 2025-05-07T20:10:38.1047208Z detected during: 2025-05-07T20:10:38.1060230Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:38.1085849Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:38.1116396Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:38.1132868Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T20:10:38.1134048Z 2025-05-07T20:10:38.1134275Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1134604Z 2025-05-07T20:10:38.1135389Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1136577Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1136979Z ^ 2025-05-07T20:10:38.1137224Z detected during: 2025-05-07T20:10:38.1150186Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:38.1175055Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:38.1201016Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:38.1215378Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T20:10:38.1216450Z 2025-05-07T20:10:38.1216676Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1217000Z 2025-05-07T20:10:38.1217780Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1218864Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1219271Z ^ 2025-05-07T20:10:38.1219510Z detected during: 2025-05-07T20:10:38.1234411Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:38.1268574Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:38.1308664Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:38.1323159Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T20:10:38.1324242Z 2025-05-07T20:10:38.1324477Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1324807Z 2025-05-07T20:10:38.1325657Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1326752Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1327154Z ^ 2025-05-07T20:10:38.1327390Z detected during: 2025-05-07T20:10:38.1340443Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:38.1365394Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:38.1391227Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:38.1405763Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T20:10:38.1406885Z 2025-05-07T20:10:38.1407113Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1407445Z 2025-05-07T20:10:38.1408229Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1409374Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1409779Z ^ 2025-05-07T20:10:38.1410025Z detected during: 2025-05-07T20:10:38.1423083Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:38.1447856Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:38.1473151Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:38.1488036Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T20:10:38.1489124Z 2025-05-07T20:10:38.1489348Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1489673Z 2025-05-07T20:10:38.1490466Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:38.1491563Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:38.1491968Z ^ 2025-05-07T20:10:38.1492205Z detected during: 2025-05-07T20:10:38.1505397Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:38.1530300Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:38.1555620Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<192>, cute::C<128>>, cute::tuple, cute::Layout>>, cute::SM100_TMEM_LOAD_16dp256b8x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:38.1570011Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=192, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu 2025-05-07T20:10:38.1571094Z 2025-05-07T20:10:38.1571320Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:38.1571653Z 2025-05-07T20:10:48.0572445Z [110/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu.o 2025-05-07T20:10:48.0590695Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:10:48.0593144Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.0594828Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.0595442Z ^ 2025-05-07T20:10:48.0595676Z 2025-05-07T20:10:48.0596009Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.0596517Z 2025-05-07T20:10:48.0597784Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.0599578Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:10:48.0600169Z ^ 2025-05-07T20:10:48.0600401Z 2025-05-07T20:10:48.0601915Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.0603607Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.0604195Z ^ 2025-05-07T20:10:48.0604551Z detected during: 2025-05-07T20:10:48.0625367Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.0664713Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.0704952Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.0727816Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.0729571Z 2025-05-07T20:10:48.0729910Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.0730415Z 2025-05-07T20:10:48.0731636Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.0733529Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.0734061Z ^ 2025-05-07T20:10:48.0734361Z detected during: 2025-05-07T20:10:48.0754155Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:48.0794398Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.0833566Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.0873161Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.0897799Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.0899564Z 2025-05-07T20:10:48.0901043Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.0903087Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.0903877Z ^ 2025-05-07T20:10:48.0904314Z detected during: 2025-05-07T20:10:48.0925210Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.0963762Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1013965Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1036822Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1038663Z 2025-05-07T20:10:48.1039026Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.1039632Z 2025-05-07T20:10:48.1040864Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1042542Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1043096Z ^ 2025-05-07T20:10:48.1043402Z detected during: 2025-05-07T20:10:48.1062989Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:48.1103390Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1142710Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1182977Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1206171Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1207887Z 2025-05-07T20:10:48.1209127Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1210825Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1211450Z ^ 2025-05-07T20:10:48.1211815Z detected during: 2025-05-07T20:10:48.1232674Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1271636Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1302522Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1316900Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1317967Z 2025-05-07T20:10:48.1318200Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.1318530Z 2025-05-07T20:10:48.1319315Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1320388Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1320750Z ^ 2025-05-07T20:10:48.1320970Z detected during: 2025-05-07T20:10:48.1333130Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:48.1357976Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1382091Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1407586Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1421721Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1422798Z 2025-05-07T20:10:48.1423574Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1424667Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1425069Z ^ 2025-05-07T20:10:48.1425313Z detected during: 2025-05-07T20:10:48.1438198Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1462408Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1487740Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1501889Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1502967Z 2025-05-07T20:10:48.1503191Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.1503514Z 2025-05-07T20:10:48.1504334Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1505399Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1505767Z ^ 2025-05-07T20:10:48.1505971Z detected during: 2025-05-07T20:10:48.1518269Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:48.1542975Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1567224Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1592612Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1606781Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1607851Z 2025-05-07T20:10:48.1608634Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1609722Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1610124Z ^ 2025-05-07T20:10:48.1610362Z detected during: 2025-05-07T20:10:48.1623091Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1647375Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1672113Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1686822Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1687895Z 2025-05-07T20:10:48.1688124Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.1688447Z 2025-05-07T20:10:48.1689293Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1690358Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1690717Z ^ 2025-05-07T20:10:48.1690924Z detected during: 2025-05-07T20:10:48.1703059Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:48.1727916Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1752236Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1776831Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1791681Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1792949Z 2025-05-07T20:10:48.1793723Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1794877Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1795279Z ^ 2025-05-07T20:10:48.1795513Z detected during: 2025-05-07T20:10:48.1808298Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1832591Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1857193Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1871277Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1872478Z 2025-05-07T20:10:48.1872703Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:10:48.1873032Z 2025-05-07T20:10:48.1873807Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:10:48.1874864Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:10:48.1875220Z ^ 2025-05-07T20:10:48.1875430Z detected during: 2025-05-07T20:10:48.1888366Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:10:48.1913248Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:10:48.1937385Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:10:48.1962723Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:10:48.1977349Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu 2025-05-07T20:10:48.1978429Z 2025-05-07T20:12:45.0959234Z [111/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu.o 2025-05-07T20:12:45.0976318Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:12:45.0978622Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.0980288Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.0980910Z ^ 2025-05-07T20:12:45.0981142Z 2025-05-07T20:12:45.0981905Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.0982422Z 2025-05-07T20:12:45.0983664Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.0986417Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:12:45.0987050Z ^ 2025-05-07T20:12:45.0987292Z 2025-05-07T20:12:45.0988519Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.0990256Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.0990877Z ^ 2025-05-07T20:12:45.0991257Z detected during: 2025-05-07T20:12:45.1012156Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1051776Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1091498Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.1113848Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.1115564Z 2025-05-07T20:12:45.1115900Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.1116416Z 2025-05-07T20:12:45.1117659Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.1119382Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.1119959Z ^ 2025-05-07T20:12:45.1120266Z detected during: 2025-05-07T20:12:45.1139998Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:12:45.1179357Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1218933Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1258574Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.1280877Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.1282632Z 2025-05-07T20:12:45.1283791Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.1286358Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.1287002Z ^ 2025-05-07T20:12:45.1287370Z detected during: 2025-05-07T20:12:45.1307497Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1345828Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1384566Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.1407394Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.1409178Z 2025-05-07T20:12:45.1409557Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.1410056Z 2025-05-07T20:12:45.1411240Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.1412905Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.1413968Z ^ 2025-05-07T20:12:45.1414265Z detected during: 2025-05-07T20:12:45.1434084Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:12:45.1474680Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1513932Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1553546Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.1576690Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.1578440Z 2025-05-07T20:12:45.1579770Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.1581447Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.1582085Z ^ 2025-05-07T20:12:45.1582448Z detected during: 2025-05-07T20:12:45.1603108Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1641231Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1681683Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.1704513Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.1706588Z 2025-05-07T20:12:45.1707150Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.1707686Z 2025-05-07T20:12:45.1708945Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.1710618Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.1711186Z ^ 2025-05-07T20:12:45.1711490Z detected during: 2025-05-07T20:12:45.1731383Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:12:45.1770931Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1810712Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1850039Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.1872902Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.1874691Z 2025-05-07T20:12:45.1875899Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.1877609Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.1878253Z ^ 2025-05-07T20:12:45.1878657Z detected during: 2025-05-07T20:12:45.1899625Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.1938339Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.1977702Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.2000797Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.2002576Z 2025-05-07T20:12:45.2002935Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.2003480Z 2025-05-07T20:12:45.2004696Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.2006406Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.2006998Z ^ 2025-05-07T20:12:45.2007318Z detected during: 2025-05-07T20:12:45.2026939Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:12:45.2066853Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.2105979Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.2145140Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.2167945Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.2169738Z 2025-05-07T20:12:45.2170981Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.2172714Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.2173333Z ^ 2025-05-07T20:12:45.2173707Z detected during: 2025-05-07T20:12:45.2194716Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.2233674Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.2272942Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.2295967Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.2297720Z 2025-05-07T20:12:45.2298083Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.2298672Z 2025-05-07T20:12:45.2310275Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.2312484Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.2313101Z ^ 2025-05-07T20:12:45.2313434Z detected during: 2025-05-07T20:12:45.2332806Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:12:45.2372464Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.2412022Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.2451281Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.2473902Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.2475612Z 2025-05-07T20:12:45.2476830Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.2478947Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.2479567Z ^ 2025-05-07T20:12:45.2479932Z detected during: 2025-05-07T20:12:45.2500642Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.2538841Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.2578548Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.2601913Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.2603697Z 2025-05-07T20:12:45.2604049Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:45.2604557Z 2025-05-07T20:12:45.2605784Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:45.2607472Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:45.2608048Z ^ 2025-05-07T20:12:45.2608357Z detected during: 2025-05-07T20:12:45.2627810Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:12:45.2667650Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:12:45.2708635Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:12:45.2748597Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:12:45.2771558Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu 2025-05-07T20:12:45.2773316Z 2025-05-07T20:12:57.3428600Z [112/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16.cu.o 2025-05-07T20:12:57.3445514Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:12:57.3447785Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:57.3449472Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:12:57.3450080Z ^ 2025-05-07T20:12:57.3450330Z 2025-05-07T20:12:57.3450670Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:12:57.3451396Z 2025-05-07T20:12:57.3452638Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:12:57.3454504Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:12:57.3455104Z ^ 2025-05-07T20:12:57.3455340Z 2025-05-07T20:13:13.2484698Z [113/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu.o 2025-05-07T20:13:13.2502070Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:13:13.2504466Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:13.2506175Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:13.2506808Z ^ 2025-05-07T20:13:13.2507068Z 2025-05-07T20:13:13.2507427Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:13.2508173Z 2025-05-07T20:13:13.2509446Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:13.2511339Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:13:13.2512099Z ^ 2025-05-07T20:13:13.2512347Z 2025-05-07T20:13:26.4350777Z [114/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu.o 2025-05-07T20:13:26.4368895Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:13:26.4371142Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4372785Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.4373304Z ^ 2025-05-07T20:13:26.4373508Z 2025-05-07T20:13:26.4373828Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.4374622Z 2025-05-07T20:13:26.4376150Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4378029Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:13:26.4378632Z ^ 2025-05-07T20:13:26.4378878Z 2025-05-07T20:13:26.4380072Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4381771Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.4382394Z ^ 2025-05-07T20:13:26.4382772Z detected during: 2025-05-07T20:13:26.4403568Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.4442088Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.4481547Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.4504229Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.4505974Z 2025-05-07T20:13:26.4506323Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.4506845Z 2025-05-07T20:13:26.4508071Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4509761Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.4510323Z ^ 2025-05-07T20:13:26.4510652Z detected during: 2025-05-07T20:13:26.4530817Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:26.4569949Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.4609251Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.4648907Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.4671291Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.4673191Z 2025-05-07T20:13:26.4674422Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4676161Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.4676942Z ^ 2025-05-07T20:13:26.4677343Z detected during: 2025-05-07T20:13:26.4697726Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.4735869Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.4774914Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.4797479Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.4799214Z 2025-05-07T20:13:26.4799563Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.4800081Z 2025-05-07T20:13:26.4801331Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4803028Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.4803576Z ^ 2025-05-07T20:13:26.4803873Z detected during: 2025-05-07T20:13:26.4823141Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:26.4862520Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.4900126Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.4938790Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.4961045Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.4962698Z 2025-05-07T20:13:26.4964048Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.4965747Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.4966374Z ^ 2025-05-07T20:13:26.4966747Z detected during: 2025-05-07T20:13:26.4987195Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5025537Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5064648Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5087470Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5089458Z 2025-05-07T20:13:26.5089814Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.5090322Z 2025-05-07T20:13:26.5091696Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5093413Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5093997Z ^ 2025-05-07T20:13:26.5094310Z detected during: 2025-05-07T20:13:26.5113769Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:26.5153003Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5190057Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5214776Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5228960Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5230034Z 2025-05-07T20:13:26.5230823Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5232039Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5232441Z ^ 2025-05-07T20:13:26.5232671Z detected during: 2025-05-07T20:13:26.5245566Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5270015Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5295638Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5309965Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5311037Z 2025-05-07T20:13:26.5311267Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.5311591Z 2025-05-07T20:13:26.5312450Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5313564Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5313929Z ^ 2025-05-07T20:13:26.5314128Z detected during: 2025-05-07T20:13:26.5326389Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:26.5351415Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5376043Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5401516Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5415816Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5416883Z 2025-05-07T20:13:26.5417657Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5418748Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5419147Z ^ 2025-05-07T20:13:26.5419377Z detected during: 2025-05-07T20:13:26.5432323Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5456734Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5481746Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5496521Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5497602Z 2025-05-07T20:13:26.5497823Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.5498307Z 2025-05-07T20:13:26.5499082Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5500139Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5500497Z ^ 2025-05-07T20:13:26.5500707Z detected during: 2025-05-07T20:13:26.5513278Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:26.5538290Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5562821Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5588272Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5606389Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5607721Z 2025-05-07T20:13:26.5608675Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5610037Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5610529Z ^ 2025-05-07T20:13:26.5610789Z detected during: 2025-05-07T20:13:26.5627134Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5652171Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5677089Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5692141Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5693230Z 2025-05-07T20:13:26.5693454Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:26.5693780Z 2025-05-07T20:13:26.5694562Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:26.5695623Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:26.5695986Z ^ 2025-05-07T20:13:26.5696194Z detected during: 2025-05-07T20:13:26.5708469Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:26.5733516Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:26.5757972Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:26.5782837Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:26.5797612Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=2, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu 2025-05-07T20:13:26.5798843Z 2025-05-07T20:13:59.2960029Z [115/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu.o 2025-05-07T20:13:59.2971330Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:13:59.2972815Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.2973908Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.2974315Z ^ 2025-05-07T20:13:59.2974471Z 2025-05-07T20:13:59.2974701Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.2975159Z 2025-05-07T20:13:59.2975957Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.2977155Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:13:59.2986047Z ^ 2025-05-07T20:13:59.2986238Z 2025-05-07T20:13:59.2987110Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.2988209Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.2988614Z ^ 2025-05-07T20:13:59.2988851Z detected during: 2025-05-07T20:13:59.3001994Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3026524Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3051455Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3065663Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3066743Z 2025-05-07T20:13:59.3066980Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.3067307Z 2025-05-07T20:13:59.3068085Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3069150Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3069520Z ^ 2025-05-07T20:13:59.3069729Z detected during: 2025-05-07T20:13:59.3082186Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:59.3107552Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3132011Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3157379Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3171587Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3172667Z 2025-05-07T20:13:59.3173457Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3174638Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3175045Z ^ 2025-05-07T20:13:59.3175286Z detected during: 2025-05-07T20:13:59.3188628Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3213109Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3237937Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3252190Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3253269Z 2025-05-07T20:13:59.3253496Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.3253828Z 2025-05-07T20:13:59.3254653Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3255720Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3256093Z ^ 2025-05-07T20:13:59.3256345Z detected during: 2025-05-07T20:13:59.3268565Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:59.3298481Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3323025Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3347726Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3361976Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3363048Z 2025-05-07T20:13:59.3363862Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3364951Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3365356Z ^ 2025-05-07T20:13:59.3365587Z detected during: 2025-05-07T20:13:59.3378353Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3403510Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3428321Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3442521Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3443638Z 2025-05-07T20:13:59.3443916Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.3444283Z 2025-05-07T20:13:59.3445059Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3446118Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3446474Z ^ 2025-05-07T20:13:59.3446681Z detected during: 2025-05-07T20:13:59.3458908Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:59.3483819Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3508699Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3533718Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3547853Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3548931Z 2025-05-07T20:13:59.3549705Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3550792Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3551189Z ^ 2025-05-07T20:13:59.3551431Z detected during: 2025-05-07T20:13:59.3564331Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3593060Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3617915Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3632296Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3633379Z 2025-05-07T20:13:59.3633605Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.3633935Z 2025-05-07T20:13:59.3634718Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3635776Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3636144Z ^ 2025-05-07T20:13:59.3636354Z detected during: 2025-05-07T20:13:59.3648577Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:59.3673639Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3698742Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3723581Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3737750Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3738830Z 2025-05-07T20:13:59.3739608Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3740698Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3741098Z ^ 2025-05-07T20:13:59.3741340Z detected during: 2025-05-07T20:13:59.3754183Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3778592Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3804259Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3818494Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3819575Z 2025-05-07T20:13:59.3819799Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.3820121Z 2025-05-07T20:13:59.3820902Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3821991Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3822355Z ^ 2025-05-07T20:13:59.3822556Z detected during: 2025-05-07T20:13:59.3834905Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:59.3859767Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3884237Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3909568Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.3923825Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.3924943Z 2025-05-07T20:13:59.3925725Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.3926843Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.3927247Z ^ 2025-05-07T20:13:59.3927478Z detected during: 2025-05-07T20:13:59.3940275Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.3964694Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.3990138Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.4004508Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.4005582Z 2025-05-07T20:13:59.4005810Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:13:59.4006131Z 2025-05-07T20:13:59.4006906Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:13:59.4007963Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:13:59.4008318Z ^ 2025-05-07T20:13:59.4008528Z detected during: 2025-05-07T20:13:59.4020795Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:13:59.4048970Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:13:59.4073466Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:13:59.4099009Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:13:59.4113681Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu 2025-05-07T20:13:59.4114766Z 2025-05-07T20:14:00.4095388Z [116/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu.o 2025-05-07T20:14:00.4112536Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:14:25.2679388Z [117/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_lite.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_lite.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_lite.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_lite.cu.o 2025-05-07T20:14:25.2696722Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:14:26.9540085Z [118/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu.o 2025-05-07T20:14:26.9557548Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:14:51.0720875Z [119/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu.o 2025-05-07T20:14:51.0738285Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:14:51.0740572Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.0742338Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.0742954Z ^ 2025-05-07T20:14:51.0743200Z 2025-05-07T20:14:51.0743534Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.0744052Z 2025-05-07T20:14:51.0745285Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.0747034Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:14:51.0747665Z ^ 2025-05-07T20:14:51.0747898Z 2025-05-07T20:14:51.0749288Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.0751027Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.0751719Z ^ 2025-05-07T20:14:51.0752271Z detected during: 2025-05-07T20:14:51.0772670Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.0811403Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.0850807Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.0873641Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.0875357Z 2025-05-07T20:14:51.0875719Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.0876245Z 2025-05-07T20:14:51.0877443Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.0879308Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.0879890Z ^ 2025-05-07T20:14:51.0880238Z detected during: 2025-05-07T20:14:51.0900199Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:14:51.0940206Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.0979158Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1019781Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1042654Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1044381Z 2025-05-07T20:14:51.1045582Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1047272Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1047888Z ^ 2025-05-07T20:14:51.1048245Z detected during: 2025-05-07T20:14:51.1068325Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.1118350Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1157466Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1179939Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1181820Z 2025-05-07T20:14:51.1182177Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.1182690Z 2025-05-07T20:14:51.1183908Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1186086Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1186644Z ^ 2025-05-07T20:14:51.1186955Z detected during: 2025-05-07T20:14:51.1206490Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:14:51.1246083Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.1294527Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1333648Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1356161Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1357898Z 2025-05-07T20:14:51.1359154Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1361041Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1361682Z ^ 2025-05-07T20:14:51.1362044Z detected during: 2025-05-07T20:14:51.1381980Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.1420821Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1459891Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1482186Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1483890Z 2025-05-07T20:14:51.1484262Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.1484773Z 2025-05-07T20:14:51.1486415Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1489865Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1490447Z ^ 2025-05-07T20:14:51.1490771Z detected during: 2025-05-07T20:14:51.1510120Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:14:51.1548783Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.1587158Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1626113Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1648761Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1650401Z 2025-05-07T20:14:51.1651552Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1653227Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1653800Z ^ 2025-05-07T20:14:51.1654135Z detected during: 2025-05-07T20:14:51.1674118Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.1712708Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1751352Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1773918Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1775641Z 2025-05-07T20:14:51.1775993Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.1776497Z 2025-05-07T20:14:51.1777727Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1779392Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1779965Z ^ 2025-05-07T20:14:51.1780273Z detected during: 2025-05-07T20:14:51.1800419Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:14:51.1839872Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.1878435Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.1918941Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.1941228Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.1943056Z 2025-05-07T20:14:51.1944270Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.1946005Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.1946633Z ^ 2025-05-07T20:14:51.1946990Z detected during: 2025-05-07T20:14:51.1967062Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.2005957Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.2045209Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.2067105Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.2068826Z 2025-05-07T20:14:51.2069193Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.2069693Z 2025-05-07T20:14:51.2070918Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.2072798Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.2073510Z ^ 2025-05-07T20:14:51.2073824Z detected during: 2025-05-07T20:14:51.2093392Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:14:51.2132426Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.2170705Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.2209761Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.2231472Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.2233329Z 2025-05-07T20:14:51.2234541Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.2236259Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.2236886Z ^ 2025-05-07T20:14:51.2237253Z detected during: 2025-05-07T20:14:51.2257234Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.2295336Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.2333721Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.2356244Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.2358192Z 2025-05-07T20:14:51.2358555Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:14:51.2359067Z 2025-05-07T20:14:51.2360312Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:14:51.2361995Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:14:51.2362547Z ^ 2025-05-07T20:14:51.2362874Z detected during: 2025-05-07T20:14:51.2382083Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<4>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:14:51.2421612Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:14:51.2459705Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:14:51.2501323Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<4>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:14:51.2523719Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=2, TBS_N=4, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu 2025-05-07T20:14:51.2525416Z 2025-05-07T20:16:20.4972511Z [120/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu.o 2025-05-07T20:16:20.4991536Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:16:20.4993998Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:20.4995727Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:20.4996330Z ^ 2025-05-07T20:16:20.4996565Z 2025-05-07T20:16:20.4996890Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:20.4997390Z 2025-05-07T20:16:20.4998662Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:20.5000844Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:16:20.5001439Z ^ 2025-05-07T20:16:20.5001638Z 2025-05-07T20:16:27.5719085Z [121/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu.o 2025-05-07T20:16:27.5736932Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:16:27.5739286Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.5740995Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.5741613Z ^ 2025-05-07T20:16:27.5741863Z 2025-05-07T20:16:27.5742196Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.5742698Z 2025-05-07T20:16:27.5743941Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.5745918Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:16:27.5746557Z ^ 2025-05-07T20:16:27.5746796Z 2025-05-07T20:16:27.5748087Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.5749822Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.5750474Z ^ 2025-05-07T20:16:27.5750811Z detected during: 2025-05-07T20:16:27.5771365Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.5810319Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.5849569Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.5872624Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.5874351Z 2025-05-07T20:16:27.5874713Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.5875360Z 2025-05-07T20:16:27.5876590Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.5878264Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.5878809Z ^ 2025-05-07T20:16:27.5879124Z detected during: 2025-05-07T20:16:27.5898468Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:27.5938043Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.5976949Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6016976Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6039511Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6041221Z 2025-05-07T20:16:27.6042403Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6044077Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6044727Z ^ 2025-05-07T20:16:27.6045102Z detected during: 2025-05-07T20:16:27.6065330Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6104245Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6143647Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6166365Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6168131Z 2025-05-07T20:16:27.6168501Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.6169041Z 2025-05-07T20:16:27.6170311Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6172023Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6172632Z ^ 2025-05-07T20:16:27.6172958Z detected during: 2025-05-07T20:16:27.6193719Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:27.6232992Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6270989Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6311164Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6333247Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6335078Z 2025-05-07T20:16:27.6336331Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6338070Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6338707Z ^ 2025-05-07T20:16:27.6339077Z detected during: 2025-05-07T20:16:27.6359159Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6397321Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6436168Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6458499Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6460251Z 2025-05-07T20:16:27.6460626Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.6461144Z 2025-05-07T20:16:27.6462561Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6464262Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6464816Z ^ 2025-05-07T20:16:27.6465234Z detected during: 2025-05-07T20:16:27.6485369Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:27.6524523Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6562811Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6602339Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6624435Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6626048Z 2025-05-07T20:16:27.6627137Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6628790Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6629412Z ^ 2025-05-07T20:16:27.6629935Z detected during: 2025-05-07T20:16:27.6650069Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6689148Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6728290Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6750997Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6752994Z 2025-05-07T20:16:27.6753360Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.6753894Z 2025-05-07T20:16:27.6755184Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6756903Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6757460Z ^ 2025-05-07T20:16:27.6757775Z detected during: 2025-05-07T20:16:27.6777185Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:27.6831213Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6869551Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.6908401Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.6930831Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.6932742Z 2025-05-07T20:16:27.6933972Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.6935741Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.6936352Z ^ 2025-05-07T20:16:27.6936703Z detected during: 2025-05-07T20:16:27.6956676Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.6995711Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.7034744Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.7057400Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.7059110Z 2025-05-07T20:16:27.7059475Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.7060000Z 2025-05-07T20:16:27.7061159Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.7062844Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.7063581Z ^ 2025-05-07T20:16:27.7063913Z detected during: 2025-05-07T20:16:27.7083412Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:27.7123613Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.7161650Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.7201582Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.7223570Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.7225348Z 2025-05-07T20:16:27.7226516Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.7228250Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.7228875Z ^ 2025-05-07T20:16:27.7229244Z detected during: 2025-05-07T20:16:27.7249426Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.7287824Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.7326315Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.7348266Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.7350079Z 2025-05-07T20:16:27.7350422Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:27.7350927Z 2025-05-07T20:16:27.7352268Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:27.7353927Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:27.7354531Z ^ 2025-05-07T20:16:27.7354842Z detected during: 2025-05-07T20:16:27.7373874Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:27.7413602Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:27.7452690Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:27.7491351Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:27.7513917Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=false]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu 2025-05-07T20:16:27.7515649Z 2025-05-07T20:16:33.1073419Z [122/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu.o 2025-05-07T20:16:33.1094386Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:16:33.1096698Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1098429Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1099043Z ^ 2025-05-07T20:16:33.1099301Z 2025-05-07T20:16:33.1099689Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.1100244Z 2025-05-07T20:16:33.1101593Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1103329Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:16:33.1103940Z ^ 2025-05-07T20:16:33.1104190Z 2025-05-07T20:16:33.1105596Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1107433Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1108057Z ^ 2025-05-07T20:16:33.1108425Z detected during: 2025-05-07T20:16:33.1128711Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.1167141Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.1207052Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.1229521Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.1231280Z 2025-05-07T20:16:33.1231643Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.1232374Z 2025-05-07T20:16:33.1233630Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1235472Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1236043Z ^ 2025-05-07T20:16:33.1236358Z detected during: 2025-05-07T20:16:33.1255838Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:33.1295610Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.1334081Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.1373131Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.1396543Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.1398367Z 2025-05-07T20:16:33.1399589Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1401285Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1401892Z ^ 2025-05-07T20:16:33.1402259Z detected during: 2025-05-07T20:16:33.1422452Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.1461311Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.1500492Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.1523068Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.1525126Z 2025-05-07T20:16:33.1525487Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.1526032Z 2025-05-07T20:16:33.1527296Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1528958Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1529518Z ^ 2025-05-07T20:16:33.1529825Z detected during: 2025-05-07T20:16:33.1548997Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:33.1588921Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.1627236Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.1667039Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.1689770Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.1691442Z 2025-05-07T20:16:33.1692644Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1694609Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1695224Z ^ 2025-05-07T20:16:33.1695602Z detected during: 2025-05-07T20:16:33.1715896Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.1754322Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.1793600Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.1816186Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.1817896Z 2025-05-07T20:16:33.1818271Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.1818782Z 2025-05-07T20:16:33.1820000Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1821688Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1822250Z ^ 2025-05-07T20:16:33.1822726Z detected during: 2025-05-07T20:16:33.1842111Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:33.1881741Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.1921205Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.1960506Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.1982765Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.1984512Z 2025-05-07T20:16:33.1986280Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.1987980Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.1988625Z ^ 2025-05-07T20:16:33.1988985Z detected during: 2025-05-07T20:16:33.2009455Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.2047724Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.2087108Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.2110986Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.2112910Z 2025-05-07T20:16:33.2113260Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.2113899Z 2025-05-07T20:16:33.2115132Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.2116836Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.2117410Z ^ 2025-05-07T20:16:33.2117731Z detected during: 2025-05-07T20:16:33.2136913Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:33.2176409Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.2215399Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.2254828Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.2277736Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.2279739Z 2025-05-07T20:16:33.2280989Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.2282721Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.2283357Z ^ 2025-05-07T20:16:33.2283721Z detected during: 2025-05-07T20:16:33.2304635Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.2343672Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.2383444Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.2406745Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.2408494Z 2025-05-07T20:16:33.2408842Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.2409375Z 2025-05-07T20:16:33.2410592Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.2412314Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.2412871Z ^ 2025-05-07T20:16:33.2413187Z detected during: 2025-05-07T20:16:33.2433048Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:33.2473361Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.2513919Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.2553361Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.2575835Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.2577569Z 2025-05-07T20:16:33.2578809Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.2580527Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.2581073Z ^ 2025-05-07T20:16:33.2581409Z detected during: 2025-05-07T20:16:33.2602614Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.2640149Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.2679568Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.2702657Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.2704398Z 2025-05-07T20:16:33.2704987Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:16:33.2705505Z 2025-05-07T20:16:33.2706710Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(729): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:16:33.2708347Z return observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:16:33.2708897Z ^ 2025-05-07T20:16:33.2709188Z detected during: 2025-05-07T20:16:33.2728625Z instantiation of "auto cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::load_init(const ProblemShape_MNKL &, cutlass::gemm::collective::CollectiveMma, TileShape_, ElementPairA_, StridePairA_, ElementPairB_, StridePairB_, TiledMma_, GmemTiledCopyPairA_, SmemLayoutAtomPairA_, SmemCopyAtomA_, TransformA_, GmemTiledCopyPairB_, SmemLayoutAtomPairB_, SmemCopyAtomB_, TransformB_>::TensorStorage &) const [with Stages=6, SchedulerPipelineStageCount=3, AccumulatorPipelineStageCount=1, ClusterShape=cute::tuple, cute::C<1>, cute::C<1>>, TileShape_=cute::tuple, cute::C<256>, cute::C<128>>, ElementPairA_=cute::tuple, StridePairA_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, ElementPairB_=cute::tuple, StridePairB_=cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, TiledMma_=cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, GmemTiledCopyPairA_=cute::tuple, SmemLayoutAtomPairA_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, SmemCopyAtomA_=void, TransformA_=cute::identity, GmemTiledCopyPairB_=cute::tuple, SmemLayoutAtomPairB_=cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, SmemCopyAtomB_=void, TransformB_=cute::identity, ProblemShape_MNKL=cute::tuple]" at line 595 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/kernel/sm100_gemm_tma_warpspecialized.hpp 2025-05-07T20:16:33.2768423Z instantiation of "void cutlass::gemm::kernel::GemmUniversal>::operator()(const cutlass::gemm::kernel::GemmUniversal>::Params &, char *) [with ProblemShape_=cute::tuple, CollectiveMainloop_=cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, CollectiveEpilogue_=cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, TileSchedulerTag_=void]" at line 122 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/device_kernel.h 2025-05-07T20:16:33.2807618Z instantiation of "void cutlass::device_kernel(Operator::Params) [with Operator=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 340 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h 2025-05-07T20:16:33.2845996Z instantiation of "cutlass::Status cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::initialize(const cutlass::gemm::device::GemmUniversalAdapter, void>::value, void>>::Arguments &, void *, cudaStream_t, cutlass::CudaHostAdapter *) [with GemmKernel_=cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1>>>, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::tuple, cute::tuple, int64_t>, cute::Layout, int32_t>, cute::tuple, int32_t>, cute::tuple>, cute::tuple, int32_t>, cute::tuple, cute::C<1>>, cute::_512>, cute::tuple, int32_t>>>>, cute::TiledMMA>, cute::Layout, cute::tuple, cute::C<0>, cute::C<0>>>, cute::tuple>, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<1>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<512>>>>>, void, cute::identity, cute::tuple, cute::tuple, cute::smem_ptr_flag_bits<4>, cute::Layout, cute::tuple>>, cute::Layout, cute::C<2>>, cute::tuple>, cute::_1, cute::tuple>, cute::tuple, cute::C<512>>, cute::tuple, cute::C<1>>>, cute::_0, cute::tuple, cute::C<1024>>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::bfloat16_t, cute::tuple, int64_t, int64_t>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<256>, cute::C<128>>, cute::tuple>, cute::SM100_TMEM_LOAD_16dp256b16x, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8>>, cute::tuple>>>, cute::SM90_U16x8_STSM_T, cute::AutoVectorizingCopyWithAssumedAlignment<128>>, void, void>]" at line 220 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_common.cuh 2025-05-07T20:16:33.2867944Z instantiation of "at::Tensor _f4f4bf16(at::Tensor, at::Tensor, at::Tensor, at::Tensor, at::Tensor) [with TB_M=256, TB_N=256, TBS_M=4, TBS_N=1, TBS_K=1, USE_MX=true]" at line 22 of /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu 2025-05-07T20:16:33.2869701Z 2025-05-07T20:17:10.4124315Z [123/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu.o 2025-05-07T20:17:10.4142440Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:17:10.4144698Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:10.4146356Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:17:10.4146954Z ^ 2025-05-07T20:17:10.4147204Z 2025-05-07T20:17:10.4147549Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:17:10.4148061Z 2025-05-07T20:17:10.4149311Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:10.4151027Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:17:10.4151624Z ^ 2025-05-07T20:17:10.4152053Z 2025-05-07T20:17:18.2929574Z [124/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu.o 2025-05-07T20:17:18.2948470Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:17:18.2950814Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:18.2952763Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:17:18.2953424Z ^ 2025-05-07T20:17:18.2953690Z 2025-05-07T20:17:18.2954057Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:17:18.2954563Z 2025-05-07T20:17:18.2955753Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:18.2957570Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:17:18.2958220Z ^ 2025-05-07T20:17:18.2958461Z 2025-05-07T20:17:26.4864957Z [125/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu.o 2025-05-07T20:17:26.4883040Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:17:26.4885802Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:26.4895974Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:17:26.4896654Z ^ 2025-05-07T20:17:26.4896908Z 2025-05-07T20:17:26.4897260Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:17:26.4897774Z 2025-05-07T20:17:26.4899013Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:26.4900725Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:17:26.4901342Z ^ 2025-05-07T20:17:26.4901585Z 2025-05-07T20:17:26.4910657Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi2EEENSA_ILi1EEESC_EEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEESJ_SK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E4M3_SS_TNILNSO_7ScaleInE1ELSQ_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESV_SV_EEEEENS5_IJNS4_10UnderscoreESY_SY_EEEEENS4_13SM90_TMA_LOADENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENST_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityENS4_23SM90_TMA_LOAD_MULTICASTES1B_vS1C_EENS_8epilogue10collective18CollectiveEpilogueINS1F_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1M_NS1F_6fusion15Sm90TreeVisitorINS1O_11Sm90ComputeINS_10multipliesES1N_fLNS_15FloatRoundStyleE2EvEEJNS1O_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SV_SV_EEELi4ELb1EEENS1P_INS1Q_IS1R_ffLS1S_2EvEEJNS1O_16Sm90RowBroadcastILi0ESI_ffNS5_IJSV_SC_SV_EEELi4ELb1EEENS1O_12Sm90AccFetchEEEEEEES11_NS12_IS14_NS15_ILi16EEENST_INS5_IJNSA_ILi64EEES17_EEENS5_IJSC_S25_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES29_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:26.4929184Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi2EEENSA_ILi1EEESC_EEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEENS_12float_e5m2_tESK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E5M2_SS_TNILNSP_7ScaleInE1ELSR_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESW_SW_EEEEENS5_IJNS4_10UnderscoreESZ_SZ_EEEEENS4_13SM90_TMA_LOADENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENSU_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityENS4_23SM90_TMA_LOAD_MULTICASTES1C_vS1D_EENS_8epilogue10collective18CollectiveEpilogueINS1G_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1N_NS1G_6fusion15Sm90TreeVisitorINS1P_11Sm90ComputeINS_10multipliesES1O_fLNS_15FloatRoundStyleE2EvEEJNS1P_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SW_SW_EEELi4ELb1EEENS1Q_INS1R_IS1S_ffLS1T_2EvEEJNS1P_16Sm90RowBroadcastILi0ESI_ffNS5_IJSW_SC_SW_EEELi4ELb1EEENS1P_12Sm90AccFetchEEEEEEES12_NS13_IS15_NS16_ILi16EEENSU_INS5_IJNSA_ILi64EEES18_EEENS5_IJSC_S26_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2A_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:26.4947968Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi2EEENSA_ILi1EEESC_EEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEESJ_SK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E4M3_SS_TNILNSO_7ScaleInE1ELSQ_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESV_SV_EEEEENS5_IJNS4_10UnderscoreESY_SY_EEEEENS4_13SM90_TMA_LOADENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENST_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityENS4_23SM90_TMA_LOAD_MULTICASTES1B_vS1C_EENS_8epilogue10collective18CollectiveEpilogueINS1F_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1M_NS1F_6fusion15Sm90TreeVisitorINS1O_11Sm90ComputeINS_4plusES1N_fLNS_15FloatRoundStyleE2EvEEJNS1O_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SV_SV_EEELi4ELb1EEENS1P_INS1Q_INS_10multipliesEffLS1S_2EvEEJS1W_NS1P_IS1Y_JNS1O_16Sm90RowBroadcastILi0ESI_ffNS5_IJSV_SC_SV_EEELi4ELb1EEENS1O_12Sm90AccFetchEEEEEEEEEES11_NS12_IS14_NS15_ILi16EEENST_INS5_IJNSA_ILi64EEES17_EEENS5_IJSC_S27_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2B_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:26.4966346Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi2EEENSA_ILi1EEESC_EEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEENS_12float_e5m2_tESK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E5M2_SS_TNILNSP_7ScaleInE1ELSR_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESW_SW_EEEEENS5_IJNS4_10UnderscoreESZ_SZ_EEEEENS4_13SM90_TMA_LOADENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENSU_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityENS4_23SM90_TMA_LOAD_MULTICASTES1C_vS1D_EENS_8epilogue10collective18CollectiveEpilogueINS1G_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1N_NS1G_6fusion15Sm90TreeVisitorINS1P_11Sm90ComputeINS_4plusES1O_fLNS_15FloatRoundStyleE2EvEEJNS1P_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SW_SW_EEELi4ELb1EEENS1Q_INS1R_INS_10multipliesEffLS1T_2EvEEJS1X_NS1Q_IS1Z_JNS1P_16Sm90RowBroadcastILi0ESI_ffNS5_IJSW_SC_SW_EEELi4ELb1EEENS1P_12Sm90AccFetchEEEEEEEEEES12_NS13_IS15_NS16_ILi16EEENSU_INS5_IJNSA_ILi64EEES18_EEENS5_IJSC_S28_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2C_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:26.4985433Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi2EEENSA_ILi1EEESC_EEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEESJ_SK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E4M3_SS_TNILNSO_7ScaleInE1ELSQ_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESV_SV_EEEEENS5_IJNS4_10UnderscoreESY_SY_EEEEENS4_13SM90_TMA_LOADENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENST_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityENS4_23SM90_TMA_LOAD_MULTICASTES1B_vS1C_EENS_8epilogue10collective18CollectiveEpilogueINS1F_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1M_NS1F_6fusion15Sm90TreeVisitorINS1O_11Sm90ComputeINS_4plusES1N_S1N_LNS_15FloatRoundStyleE2EvEEJNS1O_16Sm90ColBroadcastILi0ESI_S1N_S1N_NS5_IJSC_SV_SV_EEELi8ELb1EEENS1P_INS1Q_INS_10multipliesES1N_fLS1S_2EvEEJNS1U_ILi0ESI_ffS1V_Li4ELb1EEENS1P_INS1Q_IS1X_ffLS1S_2EvEEJNS1O_16Sm90RowBroadcastILi0ESI_ffNS5_IJSV_SC_SV_EEELi4ELb1EEENS1O_12Sm90AccFetchEEEEEEEEEES11_NS12_IS14_NS15_ILi16EEENST_INS5_IJNSA_ILi64EEES17_EEENS5_IJSC_S29_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2D_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:26.5005016Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi2EEENSA_ILi1EEESC_EEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEENS_12float_e5m2_tESK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E5M2_SS_TNILNSP_7ScaleInE1ELSR_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESW_SW_EEEEENS5_IJNS4_10UnderscoreESZ_SZ_EEEEENS4_13SM90_TMA_LOADENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENSU_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityENS4_23SM90_TMA_LOAD_MULTICASTES1C_vS1D_EENS_8epilogue10collective18CollectiveEpilogueINS1G_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1N_NS1G_6fusion15Sm90TreeVisitorINS1P_11Sm90ComputeINS_4plusES1O_S1O_LNS_15FloatRoundStyleE2EvEEJNS1P_16Sm90ColBroadcastILi0ESI_S1O_S1O_NS5_IJSC_SW_SW_EEELi8ELb1EEENS1Q_INS1R_INS_10multipliesES1O_fLS1T_2EvEEJNS1V_ILi0ESI_ffS1W_Li4ELb1EEENS1Q_INS1R_IS1Y_ffLS1T_2EvEEJNS1P_16Sm90RowBroadcastILi0ESI_ffNS5_IJSW_SC_SW_EEELi4ELb1EEENS1P_12Sm90AccFetchEEEEEEEEEES12_NS13_IS15_NS16_ILi16EEENSU_INS5_IJNSA_ILi64EEES18_EEENS5_IJSC_S2A_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2E_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:27.3232022Z [126/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu.o 2025-05-07T20:17:27.3249590Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:17:27.3251967Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:27.3253809Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:17:27.3254423Z ^ 2025-05-07T20:17:27.3254642Z 2025-05-07T20:17:27.3254956Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:17:27.3255434Z 2025-05-07T20:17:27.3256570Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:27.3258360Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:17:27.3259251Z ^ 2025-05-07T20:17:27.3259510Z 2025-05-07T20:17:27.3268115Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi4EEESB_NSA_ILi1EEEEEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEESJ_SK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E4M3_SS_TNILNSO_7ScaleInE1ELSQ_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESV_SV_EEEEENS5_IJNS4_10UnderscoreESY_SY_EEEEENS4_23SM90_TMA_LOAD_MULTICASTENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENST_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityES11_S1B_vS1C_EENS_8epilogue10collective18CollectiveEpilogueINS1E_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1L_NS1E_6fusion15Sm90TreeVisitorINS1N_11Sm90ComputeINS_10multipliesES1M_fLNS_15FloatRoundStyleE2EvEEJNS1N_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SV_SV_EEELi4ELb1EEENS1O_INS1P_IS1Q_ffLS1R_2EvEEJNS1N_16Sm90RowBroadcastILi0ESI_ffNS5_IJSV_SC_SV_EEELi4ELb1EEENS1N_12Sm90AccFetchEEEEEEENS4_13SM90_TMA_LOADENS12_IS14_NS15_ILi16EEENST_INS5_IJNSA_ILi64EEES17_EEENS5_IJSC_S25_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES29_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:27.3287731Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi4EEESB_NSA_ILi1EEEEEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEENS_12float_e5m2_tESK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E5M2_SS_TNILNSP_7ScaleInE1ELSR_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESW_SW_EEEEENS5_IJNS4_10UnderscoreESZ_SZ_EEEEENS4_23SM90_TMA_LOAD_MULTICASTENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENSU_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityES12_S1C_vS1D_EENS_8epilogue10collective18CollectiveEpilogueINS1F_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1M_NS1F_6fusion15Sm90TreeVisitorINS1O_11Sm90ComputeINS_10multipliesES1N_fLNS_15FloatRoundStyleE2EvEEJNS1O_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SW_SW_EEELi4ELb1EEENS1P_INS1Q_IS1R_ffLS1S_2EvEEJNS1O_16Sm90RowBroadcastILi0ESI_ffNS5_IJSW_SC_SW_EEELi4ELb1EEENS1O_12Sm90AccFetchEEEEEEENS4_13SM90_TMA_LOADENS13_IS15_NS16_ILi16EEENSU_INS5_IJNSA_ILi64EEES18_EEENS5_IJSC_S26_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2A_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:27.3306947Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi4EEESB_NSA_ILi1EEEEEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEESJ_SK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E4M3_SS_TNILNSO_7ScaleInE1ELSQ_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESV_SV_EEEEENS5_IJNS4_10UnderscoreESY_SY_EEEEENS4_23SM90_TMA_LOAD_MULTICASTENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENST_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityES11_S1B_vS1C_EENS_8epilogue10collective18CollectiveEpilogueINS1E_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1L_NS1E_6fusion15Sm90TreeVisitorINS1N_11Sm90ComputeINS_4plusES1M_fLNS_15FloatRoundStyleE2EvEEJNS1N_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SV_SV_EEELi4ELb1EEENS1O_INS1P_INS_10multipliesEffLS1R_2EvEEJS1V_NS1O_IS1X_JNS1N_16Sm90RowBroadcastILi0ESI_ffNS5_IJSV_SC_SV_EEELi4ELb1EEENS1N_12Sm90AccFetchEEEEEEEEEENS4_13SM90_TMA_LOADENS12_IS14_NS15_ILi16EEENST_INS5_IJNSA_ILi64EEES17_EEENS5_IJSC_S27_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2B_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:27.3325754Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi4EEESB_NSA_ILi1EEEEEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEENS_12float_e5m2_tESK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E5M2_SS_TNILNSP_7ScaleInE1ELSR_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESW_SW_EEEEENS5_IJNS4_10UnderscoreESZ_SZ_EEEEENS4_23SM90_TMA_LOAD_MULTICASTENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENSU_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityES12_S1C_vS1D_EENS_8epilogue10collective18CollectiveEpilogueINS1F_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1M_NS1F_6fusion15Sm90TreeVisitorINS1O_11Sm90ComputeINS_4plusES1N_fLNS_15FloatRoundStyleE2EvEEJNS1O_16Sm90ColBroadcastILi0ESI_ffNS5_IJSC_SW_SW_EEELi4ELb1EEENS1P_INS1Q_INS_10multipliesEffLS1S_2EvEEJS1W_NS1P_IS1Y_JNS1O_16Sm90RowBroadcastILi0ESI_ffNS5_IJSW_SC_SW_EEELi4ELb1EEENS1O_12Sm90AccFetchEEEEEEEEEENS4_13SM90_TMA_LOADENS13_IS15_NS16_ILi16EEENSU_INS5_IJNSA_ILi64EEES18_EEENS5_IJSC_S28_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2C_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:27.3344984Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi4EEESB_NSA_ILi1EEEEEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEESJ_SK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E4M3_SS_TNILNSO_7ScaleInE1ELSQ_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESV_SV_EEEEENS5_IJNS4_10UnderscoreESY_SY_EEEEENS4_23SM90_TMA_LOAD_MULTICASTENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENST_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityES11_S1B_vS1C_EENS_8epilogue10collective18CollectiveEpilogueINS1E_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1L_NS1E_6fusion15Sm90TreeVisitorINS1N_11Sm90ComputeINS_4plusES1M_S1M_LNS_15FloatRoundStyleE2EvEEJNS1N_16Sm90ColBroadcastILi0ESI_S1M_S1M_NS5_IJSC_SV_SV_EEELi8ELb1EEENS1O_INS1P_INS_10multipliesES1M_fLS1R_2EvEEJNS1T_ILi0ESI_ffS1U_Li4ELb1EEENS1O_INS1P_IS1W_ffLS1R_2EvEEJNS1N_16Sm90RowBroadcastILi0ESI_ffNS5_IJSV_SC_SV_EEELi4ELb1EEENS1N_12Sm90AccFetchEEEEEEEEEENS4_13SM90_TMA_LOADENS12_IS14_NS15_ILi16EEENST_INS5_IJNSA_ILi64EEES17_EEENS5_IJSC_S29_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2D_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:27.3364262Z ptxas info : (C7511) Potential Performance Loss: wgmma.mma_async instructions are serialized due to insufficient register resources for the wgmma pipeline in the function '_ZN7cutlass13device_kernelINS_4gemm6kernel13GemmUniversalIN4cute5tupleIJiiiEEENS1_10collective13CollectiveMmaINS1_37MainloopSm90TmaGmmaWarpSpecializedFP8ILi4ENS5_IJNS4_1CILi4EEESB_NSA_ILi1EEEEEENS1_24KernelTmaWarpSpecializedEEENS5_IJNSA_ILi128EEENSA_ILi256EEESG_EEENS_12float_e4m3_tENS5_IJlSC_lEEENS_12float_e5m2_tESK_NS4_8TiledMMAINS4_8MMA_AtomIJNS4_4SM904GMMA31MMA_64x256x32_F32E4M3E5M2_SS_TNILNSP_7ScaleInE1ELSR_1EEEEEENS4_6LayoutINS5_IJSC_SC_SC_EEENS5_IJNSA_ILi0EEESW_SW_EEEEENS5_IJNS4_10UnderscoreESZ_SZ_EEEEENS4_23SM90_TMA_LOAD_MULTICASTENS4_14ComposedLayoutINS4_7SwizzleILi3ELi4ELi3EEENS4_18smem_ptr_flag_bitsILi8EEENSU_INS5_IJNSA_ILi8EEESG_EEENS5_IJSG_SC_EEEEEEEvNS4_8identityES12_S1C_vS1D_EENS_8epilogue10collective18CollectiveEpilogueINS1F_22Sm90TmaWarpSpecializedILi4ELi2ELi16ELb0ELb1EEEJSI_NS5_IJSG_NSA_ILi32EEEEEEvNS5_IJSC_llEEENS_10bfloat16_tES1M_NS1F_6fusion15Sm90TreeVisitorINS1O_11Sm90ComputeINS_4plusES1N_S1N_LNS_15FloatRoundStyleE2EvEEJNS1O_16Sm90ColBroadcastILi0ESI_S1N_S1N_NS5_IJSC_SW_SW_EEELi8ELb1EEENS1P_INS1Q_INS_10multipliesES1N_fLS1S_2EvEEJNS1U_ILi0ESI_ffS1V_Li4ELb1EEENS1P_INS1Q_IS1X_ffLS1S_2EvEEJNS1O_16Sm90RowBroadcastILi0ESI_ffNS5_IJSW_SC_SW_EEELi4ELb1EEENS1O_12Sm90AccFetchEEEEEEEEEENS4_13SM90_TMA_LOADENS13_IS15_NS16_ILi16EEENSU_INS5_IJNSA_ILi64EEES18_EEENS5_IJSC_S2A_EEEEEEENS4_17SM75_U16x8_LDSM_TENS4_14SM90_TMA_STOREES2E_NS4_17SM90_U16x8_STSM_TENS4_9Copy_AtomIJNS4_17SM90_U32x4_STSM_NENS_6half_tEEEEvEEEvvEEEEvNT_6ParamsE' 2025-05-07T20:17:52.4246577Z [127/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu.o 2025-05-07T20:17:52.4264806Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:17:52.4267168Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:52.4268900Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:17:52.4269538Z ^ 2025-05-07T20:17:52.4269796Z 2025-05-07T20:17:52.4270155Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:17:52.4270911Z 2025-05-07T20:17:52.4272354Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:17:52.4274170Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:17:52.4274815Z ^ 2025-05-07T20:17:52.4275052Z 2025-05-07T20:18:50.7060467Z [128/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu.o 2025-05-07T20:18:50.7079500Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:18:50.7081778Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:18:50.7083543Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:18:50.7084361Z ^ 2025-05-07T20:18:50.7084612Z 2025-05-07T20:18:50.7085507Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:18:50.7086021Z 2025-05-07T20:18:50.7087288Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:18:50.7089126Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:18:50.7089762Z ^ 2025-05-07T20:18:50.7090008Z 2025-05-07T20:18:57.7161018Z [129/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu.o 2025-05-07T20:18:57.7179471Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:18:57.7181706Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:18:57.7183647Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:18:57.7184266Z ^ 2025-05-07T20:18:57.7184520Z 2025-05-07T20:18:57.7185476Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:18:57.7185986Z 2025-05-07T20:18:57.7187241Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:18:57.7188964Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:18:57.7189596Z ^ 2025-05-07T20:18:57.7189831Z 2025-05-07T20:19:04.3221473Z [130/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu.o 2025-05-07T20:19:04.3240143Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:19:04.3242464Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:04.3244429Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:19:04.3245061Z ^ 2025-05-07T20:19:04.3245304Z 2025-05-07T20:19:04.3245685Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:19:04.3246207Z 2025-05-07T20:19:04.3247551Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:04.3249348Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:19:04.3249978Z ^ 2025-05-07T20:19:04.3250211Z 2025-05-07T20:19:13.9907602Z [131/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu.o 2025-05-07T20:19:13.9926666Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:19:13.9929060Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:13.9931148Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:19:13.9931822Z ^ 2025-05-07T20:19:13.9932069Z 2025-05-07T20:19:13.9932442Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:19:13.9932987Z 2025-05-07T20:19:13.9934209Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:13.9935962Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:19:13.9936625Z ^ 2025-05-07T20:19:13.9936884Z 2025-05-07T20:19:15.2583509Z [132/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu.o 2025-05-07T20:19:15.2595302Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:19:15.2596884Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:15.2597972Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:19:15.2598371Z ^ 2025-05-07T20:19:15.2598534Z 2025-05-07T20:19:15.2598766Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:19:15.2599091Z 2025-05-07T20:19:15.2599894Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:15.2600986Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:19:15.2601388Z ^ 2025-05-07T20:19:15.2601543Z 2025-05-07T20:19:23.0305016Z [133/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu.o 2025-05-07T20:19:23.0323878Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:19:23.0326640Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:23.0328475Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:19:23.0329137Z ^ 2025-05-07T20:19:23.0329385Z 2025-05-07T20:19:23.0329753Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:19:23.0330300Z 2025-05-07T20:19:23.0331609Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:23.0333456Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:19:23.0334112Z ^ 2025-05-07T20:19:23.0334353Z 2025-05-07T20:19:30.9397763Z [134/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu.o 2025-05-07T20:19:30.9415075Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:19:30.9417514Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:30.9419224Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:19:30.9419824Z ^ 2025-05-07T20:19:30.9420080Z 2025-05-07T20:19:30.9420417Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:19:30.9420916Z 2025-05-07T20:19:30.9422129Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:19:30.9423833Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:19:30.9424397Z ^ 2025-05-07T20:19:30.9424639Z 2025-05-07T20:21:22.0306880Z [135/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/i8i8bf16.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16.cu.o 2025-05-07T20:21:22.0324556Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:21:22.0327106Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:22.0328848Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:21:22.0329495Z ^ 2025-05-07T20:21:22.0329747Z 2025-05-07T20:21:22.0330112Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:21:22.0330641Z 2025-05-07T20:21:22.0331879Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:22.0333637Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:21:22.0334267Z ^ 2025-05-07T20:21:22.0334507Z 2025-05-07T20:21:29.6218417Z [136/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu.o 2025-05-07T20:21:29.6236134Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:21:29.6238505Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:29.6240478Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:21:29.6241123Z ^ 2025-05-07T20:21:29.6241384Z 2025-05-07T20:21:29.6241770Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:21:29.6242291Z 2025-05-07T20:21:29.6243584Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:29.6245368Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:21:29.6246002Z ^ 2025-05-07T20:21:29.6246255Z 2025-05-07T20:21:56.5390634Z [137/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu.o 2025-05-07T20:21:56.5408737Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:21:56.5411055Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:56.5413016Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:21:56.5413626Z ^ 2025-05-07T20:21:56.5413869Z 2025-05-07T20:21:56.5414206Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:21:56.5414713Z 2025-05-07T20:21:56.5415993Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:56.5417753Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:21:56.5418367Z ^ 2025-05-07T20:21:56.5418593Z 2025-05-07T20:21:57.7226768Z [138/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu.o 2025-05-07T20:21:57.7245236Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:21:57.7247617Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:57.7249570Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:21:57.7250190Z ^ 2025-05-07T20:21:57.7250429Z 2025-05-07T20:21:57.7250775Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:21:57.7251283Z 2025-05-07T20:21:57.7252551Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:21:57.7254352Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:21:57.7254973Z ^ 2025-05-07T20:21:57.7255223Z 2025-05-07T20:22:42.6792643Z [139/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu.o 2025-05-07T20:22:42.6809827Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:22:42.6812226Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:22:42.6814033Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:22:42.6814609Z ^ 2025-05-07T20:22:42.6814844Z 2025-05-07T20:22:42.6815174Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:22:42.6815665Z 2025-05-07T20:22:42.6816905Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:22:42.6818579Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:22:42.6819163Z ^ 2025-05-07T20:22:42.6819391Z 2025-05-07T20:22:50.3126579Z [140/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16_fast_gemv.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16_fast_gemv.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/bf16_fast_gemv.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16_fast_gemv.cu.o 2025-05-07T20:22:50.3143131Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:22:51.0928785Z [141/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_example_py_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_nccl.cpp.o -MF experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_nccl.cpp.o.d -o experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_nccl.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/example_nccl.cpp 2025-05-07T20:22:59.3689539Z [142/155] /opt/rh/gcc-toolset-11/root/usr/bin/c++ -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_example_py_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -std=c++20 -fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -mavx2 -mf16c -mfma -fopenmp -MD -MT experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_ops.cpp.o -MF experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_ops.cpp.o.d -o experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_ops.cpp.o -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/example_ops.cpp 2025-05-07T20:23:04.4000428Z [143/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/include/fast_gemv.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/include/fast_gemv.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/include/fast_gemv.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/include/fast_gemv.cu.o 2025-05-07T20:23:04.4011302Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:23:17.3694733Z [144/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu.o 2025-05-07T20:23:17.3712169Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:23:18.9475702Z [145/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu.o 2025-05-07T20:23:18.9494081Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:23:18.9496456Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:23:18.9498198Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:23:18.9498848Z ^ 2025-05-07T20:23:18.9499104Z 2025-05-07T20:23:18.9499455Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:23:18.9499990Z 2025-05-07T20:23:18.9501292Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:23:18.9503099Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:23:18.9503936Z ^ 2025-05-07T20:23:18.9504210Z 2025-05-07T20:23:37.6702260Z [146/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu.o 2025-05-07T20:23:37.6713374Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:23:46.9402685Z [147/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/mixed_dtype_utils.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/mixed_dtype_utils.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/mixed_dtype_utils.cu.o 2025-05-07T20:23:46.9420099Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:24:36.5376495Z [148/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_example_py_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/cutlass_sgemm_nn.cu.o -MF experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/cutlass_sgemm_nn.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/example/src/cutlass_sgemm_nn.cu -o experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/cutlass_sgemm_nn.cu.o 2025-05-07T20:24:36.5387299Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:24:37.0283464Z [149/155] : && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -fPIC -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -s -shared -Wl,-soname,fbgemm_gpu_experimental_example_py.so -o experimental/example/fbgemm_gpu_experimental_example_py.so experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_nccl.cpp.o experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/example_ops.cpp.o experimental/example/CMakeFiles/fbgemm_gpu_experimental_example_py.dir/src/cutlass_sgemm_nn.cu.o -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -L/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs -L/usr/local/cuda-12.8/targets/x86_64-linux/lib -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs: /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libnvrtc.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 /usr/lib64/libcuda.so /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so" -Wl,--as-needed /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libcudart.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so" -Wl,--as-needed -lcudadevrt -lcudart_static -lrt -lpthread -ldl && : 2025-05-07T20:24:37.0420668Z [150/155] cd /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/experimental/example && bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../.github/scripts/fbgemm_gpu_postbuild.bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:37.0422204Z ################################################################################ 2025-05-07T20:24:37.0422524Z [CMAKE] Running post-build script ... 2025-05-07T20:24:37.0423232Z Target file: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:37.0423940Z Removing all RPATHs ... 2025-05-07T20:24:37.0424202Z ################################################################################ 2025-05-07T20:24:42.7946580Z [151/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu.o 2025-05-07T20:24:42.7957716Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:24:42.7959261Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:24:42.7960355Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:24:42.7960752Z ^ 2025-05-07T20:24:42.7960917Z 2025-05-07T20:24:42.7961149Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:24:42.7961479Z 2025-05-07T20:24:42.7962281Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:24:42.7963375Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:24:42.7963779Z ^ 2025-05-07T20:24:42.7963935Z 2025-05-07T20:24:44.0612597Z [152/155] /usr/local/cuda-12.8/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_experimental_gen_ai_EXPORTS -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/asmjit/src -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cpuinfo/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/tools/util/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/composable_kernel/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/json/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -I/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include -isystem /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.8/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90a,code=sm_90a -gencode arch=compute_100a,code=sm_100a -gencode arch=compute_120a,code=sm_120a -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++20 -Xcompiler=-fPIC -Wno-deprecated-anon-enum-enum-conversion -Wno-deprecated-declarations -MD -MT experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu.o -MF experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu.o.d -x cu -c /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu -o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu.o 2025-05-07T20:24:44.0623662Z nvcc warning : Support for offline compilation for architectures prior to '_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning). 2025-05-07T20:24:44.0625174Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_warpspecialized.hpp(719): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:24:44.0626267Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB_)); 2025-05-07T20:24:44.0626668Z ^ 2025-05-07T20:24:44.0626822Z 2025-05-07T20:24:44.0627052Z Remark: The warnings can be suppressed with "-diag-suppress " 2025-05-07T20:24:44.0627383Z 2025-05-07T20:24:44.0628176Z /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../external/cutlass/include/cutlass/gemm/collective/sm100_blockscaled_mma_array_warpspecialized.hpp(684): warning #2908-D: the implicit by-copy capture of "this" is deprecated 2025-05-07T20:24:44.0629273Z Tensor mSFB_tmp = observed_tma_load_sfb_->get_tma_tensor(shape(layout_SFB)); 2025-05-07T20:24:44.0629663Z ^ 2025-05-07T20:24:44.0629819Z 2025-05-07T20:24:44.9152662Z [153/155] : && /opt/rh/gcc-toolset-11/root/usr/bin/c++ -fPIC -DTORCH_USE_CUDA_DSA -DTORCH_USE_HIP_DSA -L/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib -DNO_AVX512=1 -O3 -DNDEBUG -s -shared -Wl,-soname,fbgemm_gpu_experimental_gen_ai.so -o experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/attention.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cpp.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/attention/gqa_attn_splitk.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/coalesce/coalesce.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/quantize.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/comm/car.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/gather_scatter/gather_scatter.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/moe/index_shuffling.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/kv_cache/kv_cache.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16bf16bf16_grouped.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_rowwise_batched.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/bf16i4bf16_shuffled_grouped.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_128_4_1_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_192_2_2_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_128_256_2_1_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_2_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_128_2_4_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_2_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_2_4_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_192_4_1_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_1_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_2_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_2_4_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f4f4bf16/f4f4bf16_256_256_4_1_1_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_blockwise.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_cublas.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_lite.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_128_128_2_1_1_t_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_2_1_1_f_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_128_256_128_4_4_1_f_t.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_128_128_1_1_1_f_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_16_128_1_1_1_f_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_1_1_1_f_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_256_128_2_1_1_f_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_32_128_2_1_1_f_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise/f8f8bf16_rowwise_64_64_128_2_1_1_f_f.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_cluster_size_and_transpose.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/dispatch_fp8_rowwise_batched_kernel_on_tile_size.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/f8f8bf16_rowwise_batched_impl.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_batched/handle_transposition.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_rowwise_grouped.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8f8bf16_tensorwise.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_rowwise.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/f8i4bf16_shuffled_grouped.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/i8i8bf16_dynamic.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/cutlass_extensions/mixed_dtype_utils.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16_fast_gemv.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/bf16fp8bf16_fast_gemv.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/fp8fp8bf16_fast_gemv.cu.o experimental/gen_ai/CMakeFiles/fbgemm_gpu_experimental_gen_ai.dir/src/quantize/fast_gemv/include/fast_gemv.cu.o -L/lib/intel64 -L/lib/intel64_win -L/lib/win-x64 -L/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs -L/usr/local/cuda-12.8/targets/x86_64-linux/lib -Wl,-rpath,/lib/intel64:/lib/intel64_win:/lib/win-x64:/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib:/usr/local/cuda-12.8/lib64:/usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs: /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libnvrtc.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/nvidia/nccl/lib/libnccl.so.2 /usr/lib64/libcuda.so /usr/local/cuda-12.8/targets/x86_64-linux/lib/stubs/libnvidia-ml.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so" -Wl,--as-needed /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10_cuda.so /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libc10.so /usr/local/cuda-12.8/lib64/libcudart.so -Wl,--no-as-needed,"/__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/lib/libtorch.so" -Wl,--as-needed -lcudadevrt -lcudart_static -lrt -lpthread -ldl && : 2025-05-07T20:24:45.1964457Z [154/155] cd /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai && bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/../.github/scripts/fbgemm_gpu_postbuild.bash /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:45.1965937Z ################################################################################ 2025-05-07T20:24:45.1966259Z [CMAKE] Running post-build script ... 2025-05-07T20:24:45.1966961Z Target file: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:45.1967656Z Removing all RPATHs ... 2025-05-07T20:24:45.1967912Z ################################################################################ 2025-05-07T20:24:45.1968832Z [154/155] cd /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-build && /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/cmake/data/bin/cmake -P cmake_install.cmake 2025-05-07T20:24:45.2033503Z -- Install configuration: "Release" 2025-05-07T20:24:45.2035605Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/asmjit.so 2025-05-07T20:24:45.2055642Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/fbgemm.so 2025-05-07T20:24:45.2061052Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:45.2074036Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.2075472Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/__init__.py 2025-05-07T20:24:45.2081690Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/ck_bf16_bench.py 2025-05-07T20:24:45.2083141Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/comm_bench.py 2025-05-07T20:24:45.2085394Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/gather_scatter_bench.py 2025-05-07T20:24:45.2087181Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/quantize_bench.py 2025-05-07T20:24:45.2088972Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/quantize_ops.py 2025-05-07T20:24:45.2090868Z -- Up-to-date: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.2091880Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/__init__.py 2025-05-07T20:24:45.2097509Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.2098633Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/README.md 2025-05-07T20:24:45.2104442Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/__init__.py 2025-05-07T20:24:45.2105657Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/activation.py 2025-05-07T20:24:45.2107407Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/gather_scatter.py 2025-05-07T20:24:45.2109131Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/layers.py 2025-05-07T20:24:45.2110823Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/shuffling.py 2025-05-07T20:24:45.2112591Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/quantize.py 2025-05-07T20:24:45.2118112Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:45.2137512Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/example/__init__.py 2025-05-07T20:24:45.2144787Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/example/utils.py 2025-05-07T20:24:45.2149188Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/__init__.py 2025-05-07T20:24:45.2150746Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py 2025-05-07T20:24:45.2152487Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py 2025-05-07T20:24:45.2154284Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/matmul_perf_model.py 2025-05-07T20:24:45.2155803Z -- Installing: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/_skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/utils.py 2025-05-07T20:24:45.2178744Z 2025-05-07T20:24:45.2239831Z 2025-05-07T20:24:45.2240300Z copying fbgemm_gpu/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/__init__.py 2025-05-07T20:24:45.2241105Z copying fbgemm_gpu/batched_unary_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/batched_unary_embeddings_ops.py 2025-05-07T20:24:45.2244455Z copying fbgemm_gpu/enums.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/enums.py 2025-05-07T20:24:45.2250155Z copying fbgemm_gpu/metrics.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/metrics.py 2025-05-07T20:24:45.2255771Z copying fbgemm_gpu/permute_pooled_embedding_modules.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/permute_pooled_embedding_modules.py 2025-05-07T20:24:45.2262255Z copying fbgemm_gpu/permute_pooled_embedding_modules_split.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/permute_pooled_embedding_modules_split.py 2025-05-07T20:24:45.2266891Z copying fbgemm_gpu/quantize_comm.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize_comm.py 2025-05-07T20:24:45.2276361Z copying fbgemm_gpu/quantize_utils.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize_utils.py 2025-05-07T20:24:45.2282708Z copying fbgemm_gpu/runtime_monitor.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/runtime_monitor.py 2025-05-07T20:24:45.2288931Z copying fbgemm_gpu/sparse_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sparse_ops.py 2025-05-07T20:24:45.2297139Z copying fbgemm_gpu/split_embedding_configs.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_configs.py 2025-05-07T20:24:45.2302520Z copying fbgemm_gpu/split_embedding_inference_converter.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_inference_converter.py 2025-05-07T20:24:45.2307723Z copying fbgemm_gpu/split_embedding_optimizer_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_optimizer_ops.py 2025-05-07T20:24:45.2312787Z copying fbgemm_gpu/split_embedding_utils.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_utils.py 2025-05-07T20:24:45.2324722Z copying fbgemm_gpu/split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops.py 2025-05-07T20:24:45.2340339Z copying fbgemm_gpu/split_table_batched_embeddings_ops_common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_common.py 2025-05-07T20:24:45.2345771Z copying fbgemm_gpu/split_table_batched_embeddings_ops_inference.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_inference.py 2025-05-07T20:24:45.2356107Z copying fbgemm_gpu/split_table_batched_embeddings_ops_training.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_training.py 2025-05-07T20:24:45.2374723Z copying fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py 2025-05-07T20:24:45.2379685Z copying fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py 2025-05-07T20:24:45.2384435Z copying fbgemm_gpu/tbe_input_multiplexer.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe_input_multiplexer.py 2025-05-07T20:24:45.2389697Z copying fbgemm_gpu/uvm.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/uvm.py 2025-05-07T20:24:45.2394990Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config 2025-05-07T20:24:45.2395725Z copying fbgemm_gpu/config/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config/__init__.py 2025-05-07T20:24:45.2400658Z copying fbgemm_gpu/config/feature_list.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config/feature_list.py 2025-05-07T20:24:45.2405334Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs 2025-05-07T20:24:45.2405949Z copying fbgemm_gpu/docs/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/__init__.py 2025-05-07T20:24:45.2410617Z copying fbgemm_gpu/docs/common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/common.py 2025-05-07T20:24:45.2415427Z copying fbgemm_gpu/docs/examples.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/examples.py 2025-05-07T20:24:45.2420010Z copying fbgemm_gpu/docs/jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/jagged_tensor_ops.py 2025-05-07T20:24:45.2425213Z copying fbgemm_gpu/docs/merge_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/merge_pooled_embedding_ops.py 2025-05-07T20:24:45.2436455Z copying fbgemm_gpu/docs/permute_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/permute_pooled_embedding_ops.py 2025-05-07T20:24:45.2441311Z copying fbgemm_gpu/docs/quantize_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/quantize_ops.py 2025-05-07T20:24:45.2446080Z copying fbgemm_gpu/docs/sparse_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/sparse_ops.py 2025-05-07T20:24:45.2452595Z copying fbgemm_gpu/docs/version.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/version.py 2025-05-07T20:24:45.2453783Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize 2025-05-07T20:24:45.2454463Z copying fbgemm_gpu/quantize/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize/__init__.py 2025-05-07T20:24:45.2459722Z copying fbgemm_gpu/quantize/quantize_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize/quantize_ops.py 2025-05-07T20:24:45.2472487Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll 2025-05-07T20:24:45.2473101Z copying fbgemm_gpu/sll/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/__init__.py 2025-05-07T20:24:45.2478876Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe 2025-05-07T20:24:45.2479481Z copying fbgemm_gpu/tbe/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/__init__.py 2025-05-07T20:24:45.2484241Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton 2025-05-07T20:24:45.2485010Z copying fbgemm_gpu/triton/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/__init__.py 2025-05-07T20:24:45.2492676Z copying fbgemm_gpu/triton/common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/common.py 2025-05-07T20:24:45.2497605Z copying fbgemm_gpu/triton/quantize.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/quantize.py 2025-05-07T20:24:45.2504191Z copying fbgemm_gpu/triton/quantize_ref.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/quantize_ref.py 2025-05-07T20:24:45.2510504Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils 2025-05-07T20:24:45.2511139Z copying fbgemm_gpu/utils/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/__init__.py 2025-05-07T20:24:45.2516787Z copying fbgemm_gpu/utils/filestore.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/filestore.py 2025-05-07T20:24:45.2521980Z copying fbgemm_gpu/utils/loader.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/loader.py 2025-05-07T20:24:45.2530924Z copying fbgemm_gpu/utils/torch_library.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/torch_library.py 2025-05-07T20:24:45.2536344Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.2537075Z copying fbgemm_gpu/sll/cpu/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu/__init__.py 2025-05-07T20:24:45.2543288Z copying fbgemm_gpu/sll/cpu/cpu_sll.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu/cpu_sll.py 2025-05-07T20:24:45.2549560Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.2550226Z copying fbgemm_gpu/sll/meta/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta/__init__.py 2025-05-07T20:24:45.2555985Z copying fbgemm_gpu/sll/meta/meta_sll.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta/meta_sll.py 2025-05-07T20:24:45.2562042Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.2562731Z copying fbgemm_gpu/sll/triton/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/__init__.py 2025-05-07T20:24:45.2567851Z copying fbgemm_gpu/sll/triton/common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/common.py 2025-05-07T20:24:45.2573110Z copying fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py 2025-05-07T20:24:45.2578311Z copying fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py 2025-05-07T20:24:45.2583464Z copying fbgemm_gpu/sll/triton/triton_jagged_bmm.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_bmm.py 2025-05-07T20:24:45.2590228Z copying fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py 2025-05-07T20:24:45.2596779Z copying fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py 2025-05-07T20:24:45.2601885Z copying fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py 2025-05-07T20:24:45.2615692Z copying fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py 2025-05-07T20:24:45.2621983Z copying fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py 2025-05-07T20:24:45.2628865Z copying fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py 2025-05-07T20:24:45.2634788Z copying fbgemm_gpu/sll/triton/triton_jagged_softmax.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_softmax.py 2025-05-07T20:24:45.2640939Z copying fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py 2025-05-07T20:24:45.2647238Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.2647924Z copying fbgemm_gpu/tbe/bench/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/__init__.py 2025-05-07T20:24:45.2653135Z copying fbgemm_gpu/tbe/bench/bench_config.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/bench_config.py 2025-05-07T20:24:45.2661476Z copying fbgemm_gpu/tbe/bench/bench_runs.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/bench_runs.py 2025-05-07T20:24:45.2668875Z copying fbgemm_gpu/tbe/bench/eeg_cli.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/eeg_cli.py 2025-05-07T20:24:45.2674150Z copying fbgemm_gpu/tbe/bench/embedding_ops_common_config.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/embedding_ops_common_config.py 2025-05-07T20:24:45.2679352Z copying fbgemm_gpu/tbe/bench/eval_compression.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/eval_compression.py 2025-05-07T20:24:45.2683823Z copying fbgemm_gpu/tbe/bench/reporter.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/reporter.py 2025-05-07T20:24:45.2689459Z copying fbgemm_gpu/tbe/bench/tbe_data_config.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config.py 2025-05-07T20:24:45.2695440Z copying fbgemm_gpu/tbe/bench/tbe_data_config_loader.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config_loader.py 2025-05-07T20:24:45.2700636Z copying fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py 2025-05-07T20:24:45.2707550Z copying fbgemm_gpu/tbe/bench/utils.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/utils.py 2025-05-07T20:24:45.2721211Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.2721894Z copying fbgemm_gpu/tbe/cache/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache/__init__.py 2025-05-07T20:24:45.2726716Z copying fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py 2025-05-07T20:24:45.2740104Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.2740756Z copying fbgemm_gpu/tbe/ssd/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/__init__.py 2025-05-07T20:24:45.2745551Z copying fbgemm_gpu/tbe/ssd/common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/common.py 2025-05-07T20:24:45.2750221Z copying fbgemm_gpu/tbe/ssd/inference.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/inference.py 2025-05-07T20:24:45.2756496Z copying fbgemm_gpu/tbe/ssd/training.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/training.py 2025-05-07T20:24:45.2775333Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.2776008Z copying fbgemm_gpu/tbe/stats/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats/__init__.py 2025-05-07T20:24:45.2782152Z copying fbgemm_gpu/tbe/stats/bench_params_reporter.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats/bench_params_reporter.py 2025-05-07T20:24:45.2788008Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.2788685Z copying fbgemm_gpu/tbe/utils/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/__init__.py 2025-05-07T20:24:45.2795611Z copying fbgemm_gpu/tbe/utils/common.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/common.py 2025-05-07T20:24:45.2801168Z copying fbgemm_gpu/tbe/utils/offsets.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/offsets.py 2025-05-07T20:24:45.2806010Z copying fbgemm_gpu/tbe/utils/quantize.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/quantize.py 2025-05-07T20:24:45.2811598Z copying fbgemm_gpu/tbe/utils/requests.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/requests.py 2025-05-07T20:24:45.2817928Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.2818651Z copying fbgemm_gpu/tbe/ssd/utils/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils/__init__.py 2025-05-07T20:24:45.2823981Z copying fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py 2025-05-07T20:24:45.2828879Z creating directory _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.2829611Z copying fbgemm_gpu/triton/jagged/__init__.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged/__init__.py 2025-05-07T20:24:45.2834391Z copying fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py 2025-05-07T20:24:45.2842654Z 2025-05-07T20:24:45.2962230Z INFO:root:running bdist_wheel 2025-05-07T20:24:45.3010873Z INFO:root:running build 2025-05-07T20:24:45.3011132Z INFO:root:running build_py 2025-05-07T20:24:45.3018082Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3020158Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3022601Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/batched_unary_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3024331Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/enums.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3026549Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/metrics.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3028858Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/permute_pooled_embedding_modules.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3030882Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/permute_pooled_embedding_modules_split.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3033185Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize_comm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3035862Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize_utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3037684Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/runtime_monitor.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3039935Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sparse_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3042550Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_configs.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3044551Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_inference_converter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3046842Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_optimizer_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3048944Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3051112Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3053090Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3055525Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_inference.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3058372Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_training.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3062464Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3064382Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3066556Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe_input_multiplexer.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3068593Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/uvm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3071306Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config 2025-05-07T20:24:45.3072733Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config 2025-05-07T20:24:45.3075318Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config/feature_list.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config 2025-05-07T20:24:45.3078512Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3079635Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3082031Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3084101Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/examples.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3086172Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3088718Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/merge_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3090664Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/permute_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3092890Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3094791Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/sparse_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3097464Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/version.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.3100031Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize 2025-05-07T20:24:45.3101512Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize 2025-05-07T20:24:45.3103907Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize 2025-05-07T20:24:45.3106312Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll 2025-05-07T20:24:45.3107663Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll 2025-05-07T20:24:45.3110368Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe 2025-05-07T20:24:45.3111703Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe 2025-05-07T20:24:45.3115005Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.3116177Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.3118663Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.3120373Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.3122864Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/quantize_ref.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.3125816Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.3127031Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.3129516Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/filestore.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.3131321Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/loader.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.3133777Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/torch_library.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.3136314Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.3137600Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.3140050Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu/cpu_sll.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.3142714Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.3144080Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.3146529Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta/meta_sll.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.3149988Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3151382Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3154068Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3156011Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3158116Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3160217Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_bmm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3162613Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3165185Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3167196Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3169348Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3171908Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3174041Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3176092Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_softmax.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3179284Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.3182777Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3184234Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3186575Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/bench_config.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3189127Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/bench_runs.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3191319Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/eeg_cli.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3193636Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/embedding_ops_common_config.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3195773Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/eval_compression.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3197825Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/reporter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3199912Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3202655Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config_loader.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3204542Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3206866Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.3209343Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.3210739Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.3213331Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.3215776Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.3217091Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.3219445Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.3221447Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/inference.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.3223868Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/training.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.3227306Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.3228659Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.3231114Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats/bench_params_reporter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.3234058Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.3235306Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.3237684Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.3239628Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/offsets.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.3241955Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.3244113Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/requests.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.3246974Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.3248310Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.3250864Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.3253272Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.3254698Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.3257089Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.3313416Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/asmjit.so -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3348429Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/fbgemm.so -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.3645484Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.3646930Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.6829812Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6830891Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6840563Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/ck_bf16_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6850656Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/comm_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6861252Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/gather_scatter_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6871297Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/quantize_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6882115Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/bench/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.6903291Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.6913798Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6915002Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/README.md -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6924484Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6933580Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/activation.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6942934Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/gather_scatter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6953817Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/layers.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6974540Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/moe/shuffling.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.6984292Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gen_ai/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.6995126Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example 2025-05-07T20:24:45.6996365Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/example/fbgemm_gpu_experimental_example_py.so -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example 2025-05-07T20:24:45.7030993Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/example/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example 2025-05-07T20:24:45.7040076Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/example/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example 2025-05-07T20:24:45.7047656Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.7049285Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.7058307Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.7078188Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.7098911Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/matmul_perf_model.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.7109533Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/experimental/gemm/triton_gemm/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.7118014Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7120455Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/batched_unary_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7122439Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/enums.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7124941Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/metrics.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7127539Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/permute_pooled_embedding_modules.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7130059Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/permute_pooled_embedding_modules_split.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7132282Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize_comm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7134783Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize_utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7137268Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/runtime_monitor.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7139858Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sparse_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7142737Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_configs.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7145111Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_inference_converter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7147556Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_optimizer_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7150014Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_embedding_utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7152540Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7155037Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7157360Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_inference.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7160707Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_training.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7165212Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7167634Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7170025Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe_input_multiplexer.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7172324Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/uvm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu 2025-05-07T20:24:45.7174785Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config 2025-05-07T20:24:45.7177215Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/config/feature_list.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config 2025-05-07T20:24:45.7179595Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7182014Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7184445Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/examples.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7187331Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7189708Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/merge_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7192782Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/permute_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7195830Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7198104Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/sparse_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7200770Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/docs/version.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs 2025-05-07T20:24:45.7203020Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize 2025-05-07T20:24:45.7207214Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/quantize/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize 2025-05-07T20:24:45.7209150Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll 2025-05-07T20:24:45.7211874Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe 2025-05-07T20:24:45.7214005Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.7215927Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.7218212Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.7220622Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/quantize_ref.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton 2025-05-07T20:24:45.7222612Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.7224837Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/filestore.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.7226724Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/loader.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.7229030Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/utils/torch_library.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils 2025-05-07T20:24:45.7230931Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.7233429Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/cpu/cpu_sll.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.7235907Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.7244504Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/meta/meta_sll.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.7245753Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7247001Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7248353Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7249792Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7251248Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_bmm.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7252616Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7254106Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7255627Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7257849Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7260500Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7262741Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7264787Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_jagged_softmax.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7267323Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7269337Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7271735Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/bench_config.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7274122Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/bench_runs.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7276455Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/eeg_cli.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7278570Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/embedding_ops_common_config.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7281239Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/eval_compression.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7287096Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/reporter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7289331Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7291319Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config_loader.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7293629Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7295826Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/bench/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7298083Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.7300358Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.7302274Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7304596Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7306896Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/inference.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7309230Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/training.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7312435Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.7314749Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/stats/bench_params_reporter.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.7316664Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7318979Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/common.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7320896Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/offsets.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7323190Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7325162Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/utils/requests.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7327672Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.7329983Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.7331927Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.7334252Z INFO:root:copying _skbuild/linux-x86_64-3.9/cmake-install/fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.7354259Z INFO:skbuild:copied 90 files 2025-05-07T20:24:45.7354874Z INFO:root:running build_ext 2025-05-07T20:24:45.7358307Z INFO:root:installing to _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel 2025-05-07T20:24:45.7359164Z INFO:root:running install 2025-05-07T20:24:45.7412703Z INFO:root:running install_lib 2025-05-07T20:24:45.7415212Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel 2025-05-07T20:24:45.7417006Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu 2025-05-07T20:24:45.7418698Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/config 2025-05-07T20:24:45.7419714Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/config 2025-05-07T20:24:45.7421351Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/config/feature_list.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/config 2025-05-07T20:24:45.7422909Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/docs 2025-05-07T20:24:45.7423996Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7425671Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7427008Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/examples.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7428702Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7430133Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/merge_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7431705Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/permute_pooled_embedding_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7433235Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7434831Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/sparse_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7436726Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/docs/version.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/docs 2025-05-07T20:24:45.7438295Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/quantize 2025-05-07T20:24:45.7439805Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/quantize 2025-05-07T20:24:45.7441354Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/quantize 2025-05-07T20:24:45.7442765Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/sll 2025-05-07T20:24:45.7444287Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.7445736Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.7447300Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/cpu/cpu_sll.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/cpu 2025-05-07T20:24:45.7448930Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/sll/meta 2025-05-07T20:24:45.7450944Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/meta 2025-05-07T20:24:45.7452326Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/meta/meta_sll.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/meta 2025-05-07T20:24:45.7453737Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7455497Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7457124Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7458642Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7460250Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7461782Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_bmm.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7463482Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7465093Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7466819Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7468483Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7470112Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7471748Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7473565Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_jagged_softmax.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7475164Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll/triton 2025-05-07T20:24:45.7476688Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sll/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/sll 2025-05-07T20:24:45.7478106Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe 2025-05-07T20:24:45.7479868Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7481400Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7482810Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/bench_config.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7484431Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/bench_runs.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7486319Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/eeg_cli.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7490576Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/embedding_ops_common_config.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7492183Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/eval_compression.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7493640Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/reporter.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7495145Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/tbe_data_config.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7496910Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/tbe_data_config_loader.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7498524Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7500103Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/bench/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/bench 2025-05-07T20:24:45.7501671Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.7503571Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.7505028Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/cache 2025-05-07T20:24:45.7506409Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7508078Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.7509397Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.7511153Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/ssd/utils 2025-05-07T20:24:45.7512908Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7514286Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7515874Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/inference.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7517510Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/ssd/training.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/ssd 2025-05-07T20:24:45.7519872Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.7521254Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.7522814Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/stats/bench_params_reporter.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/stats 2025-05-07T20:24:45.7524384Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7525849Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7527526Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils/common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7529171Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils/offsets.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7530597Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7532013Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/utils/requests.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe/utils 2025-05-07T20:24:45.7533792Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/tbe 2025-05-07T20:24:45.7535296Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/triton 2025-05-07T20:24:45.7536742Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.7538247Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.7539944Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/triton/jagged 2025-05-07T20:24:45.7541552Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/triton 2025-05-07T20:24:45.7543076Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/triton 2025-05-07T20:24:45.7544590Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/triton 2025-05-07T20:24:45.7546432Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/triton/quantize_ref.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/triton 2025-05-07T20:24:45.7548570Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/utils 2025-05-07T20:24:45.7549971Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/utils 2025-05-07T20:24:45.7551321Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils/filestore.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/utils 2025-05-07T20:24:45.7553026Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils/loader.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/utils 2025-05-07T20:24:45.7554633Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/utils/torch_library.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/utils 2025-05-07T20:24:45.7556158Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/asmjit.so -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.7562001Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/fbgemm.so -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.7609998Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental 2025-05-07T20:24:45.7611191Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.7612850Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.8425872Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.8427064Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8428322Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe/README.md -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8430061Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8431861Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe/activation.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8433546Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe/gather_scatter.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8435187Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe/layers.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8436804Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/moe/shuffling.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai/moe 2025-05-07T20:24:45.8438509Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gen_ai/quantize.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gen_ai 2025-05-07T20:24:45.8439758Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8441196Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8442784Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench/ck_bf16_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8444445Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench/comm_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8446072Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench/gather_scatter_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8447753Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench/quantize_bench.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8449528Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/bench/quantize_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/bench 2025-05-07T20:24:45.8451334Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental/example 2025-05-07T20:24:45.8453072Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example/fbgemm_gpu_experimental_example_py.so -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/example 2025-05-07T20:24:45.8458822Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/example 2025-05-07T20:24:45.8460469Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/example/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/example 2025-05-07T20:24:45.8461757Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental/gemm 2025-05-07T20:24:45.8463142Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.8464840Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.8466538Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.8469514Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.8471379Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm/matmul_perf_model.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.8473316Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/experimental/gemm/triton_gemm/utils.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu/experimental/gemm/triton_gemm 2025-05-07T20:24:45.8474777Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/__init__.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8476210Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/batched_unary_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8477817Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/enums.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8479360Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/metrics.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8480981Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/permute_pooled_embedding_modules.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8482511Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/permute_pooled_embedding_modules_split.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8483884Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize_comm.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8485832Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/quantize_utils.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8487445Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/runtime_monitor.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8488970Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/sparse_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8490877Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_embedding_configs.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8492537Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_embedding_inference_converter.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8493966Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_embedding_optimizer_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8495540Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_embedding_utils.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8497050Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8498515Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_table_batched_embeddings_ops_common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8500205Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_table_batched_embeddings_ops_inference.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8502450Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_table_batched_embeddings_ops_training.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8505569Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8507127Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8508537Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/tbe_input_multiplexer.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8509820Z INFO:root:copying _skbuild/linux-x86_64-3.9/setuptools/lib.linux-x86_64-cpython-39/fbgemm_gpu/uvm.py -> _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu 2025-05-07T20:24:45.8510772Z INFO:skbuild:copied 115 files 2025-05-07T20:24:45.8511035Z INFO:root:running install_egg_info 2025-05-07T20:24:45.8577534Z INFO:root:running egg_info 2025-05-07T20:24:45.8610704Z INFO:root:creating fbgemm_gpu_genai.egg-info 2025-05-07T20:24:45.8611772Z INFO:root:writing fbgemm_gpu_genai.egg-info/PKG-INFO 2025-05-07T20:24:45.8616677Z INFO:root:writing dependency_links to fbgemm_gpu_genai.egg-info/dependency_links.txt 2025-05-07T20:24:45.8619212Z INFO:root:writing requirements to fbgemm_gpu_genai.egg-info/requires.txt 2025-05-07T20:24:45.8620369Z INFO:root:writing top-level names to fbgemm_gpu_genai.egg-info/top_level.txt 2025-05-07T20:24:45.8622166Z INFO:root:writing manifest file 'fbgemm_gpu_genai.egg-info/SOURCES.txt' 2025-05-07T20:24:45.8735056Z INFO:root:reading manifest file 'fbgemm_gpu_genai.egg-info/SOURCES.txt' 2025-05-07T20:24:45.8761816Z INFO:root:writing manifest file 'fbgemm_gpu_genai.egg-info/SOURCES.txt' 2025-05-07T20:24:45.8762932Z INFO:root:Copying fbgemm_gpu_genai.egg-info to _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/./fbgemm_gpu_genai-2025.5.7+cu128-py3.9.egg-info 2025-05-07T20:24:45.8772410Z INFO:root:running install_scripts 2025-05-07T20:24:45.8772678Z INFO:skbuild:copied 0 files 2025-05-07T20:24:48.9901618Z INFO:root:creating _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel/fbgemm_gpu_genai-2025.5.7+cu128.dist-info/WHEEL 2025-05-07T20:24:48.9905869Z INFO:wheel:creating '/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/dist/.tmp-4fpfzjvl/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl' and adding '_skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel' to it 2025-05-07T20:24:48.9909523Z INFO:wheel:adding 'fbgemm_gpu/__init__.py' 2025-05-07T20:24:49.0185166Z INFO:wheel:adding 'fbgemm_gpu/asmjit.so' 2025-05-07T20:24:49.0191131Z INFO:wheel:adding 'fbgemm_gpu/batched_unary_embeddings_ops.py' 2025-05-07T20:24:49.0193526Z INFO:wheel:adding 'fbgemm_gpu/enums.py' 2025-05-07T20:24:49.1836638Z INFO:wheel:adding 'fbgemm_gpu/fbgemm.so' 2025-05-07T20:24:49.1877223Z INFO:wheel:adding 'fbgemm_gpu/metrics.py' 2025-05-07T20:24:49.1880062Z INFO:wheel:adding 'fbgemm_gpu/permute_pooled_embedding_modules.py' 2025-05-07T20:24:49.1882403Z INFO:wheel:adding 'fbgemm_gpu/permute_pooled_embedding_modules_split.py' 2025-05-07T20:24:49.1885894Z INFO:wheel:adding 'fbgemm_gpu/quantize_comm.py' 2025-05-07T20:24:49.1889815Z INFO:wheel:adding 'fbgemm_gpu/quantize_utils.py' 2025-05-07T20:24:49.1893114Z INFO:wheel:adding 'fbgemm_gpu/runtime_monitor.py' 2025-05-07T20:24:49.1904499Z INFO:wheel:adding 'fbgemm_gpu/sparse_ops.py' 2025-05-07T20:24:49.1908056Z INFO:wheel:adding 'fbgemm_gpu/split_embedding_configs.py' 2025-05-07T20:24:49.1911651Z INFO:wheel:adding 'fbgemm_gpu/split_embedding_inference_converter.py' 2025-05-07T20:24:49.1913369Z INFO:wheel:adding 'fbgemm_gpu/split_embedding_optimizer_ops.py' 2025-05-07T20:24:49.1915614Z INFO:wheel:adding 'fbgemm_gpu/split_embedding_utils.py' 2025-05-07T20:24:49.1917984Z INFO:wheel:adding 'fbgemm_gpu/split_table_batched_embeddings_ops.py' 2025-05-07T20:24:49.1921561Z INFO:wheel:adding 'fbgemm_gpu/split_table_batched_embeddings_ops_common.py' 2025-05-07T20:24:49.1941836Z INFO:wheel:adding 'fbgemm_gpu/split_table_batched_embeddings_ops_inference.py' 2025-05-07T20:24:49.1983174Z INFO:wheel:adding 'fbgemm_gpu/split_table_batched_embeddings_ops_training.py' 2025-05-07T20:24:49.1988663Z INFO:wheel:adding 'fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py' 2025-05-07T20:24:49.1990349Z INFO:wheel:adding 'fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py' 2025-05-07T20:24:49.1993405Z INFO:wheel:adding 'fbgemm_gpu/tbe_input_multiplexer.py' 2025-05-07T20:24:49.1994707Z INFO:wheel:adding 'fbgemm_gpu/uvm.py' 2025-05-07T20:24:49.1996470Z INFO:wheel:adding 'fbgemm_gpu/config/__init__.py' 2025-05-07T20:24:49.1998685Z INFO:wheel:adding 'fbgemm_gpu/config/feature_list.py' 2025-05-07T20:24:49.2001107Z INFO:wheel:adding 'fbgemm_gpu/docs/__init__.py' 2025-05-07T20:24:49.2003002Z INFO:wheel:adding 'fbgemm_gpu/docs/common.py' 2025-05-07T20:24:49.2005343Z INFO:wheel:adding 'fbgemm_gpu/docs/examples.py' 2025-05-07T20:24:49.2008219Z INFO:wheel:adding 'fbgemm_gpu/docs/jagged_tensor_ops.py' 2025-05-07T20:24:49.2010306Z INFO:wheel:adding 'fbgemm_gpu/docs/merge_pooled_embedding_ops.py' 2025-05-07T20:24:49.2012968Z INFO:wheel:adding 'fbgemm_gpu/docs/permute_pooled_embedding_ops.py' 2025-05-07T20:24:49.2015257Z INFO:wheel:adding 'fbgemm_gpu/docs/quantize_ops.py' 2025-05-07T20:24:49.2021390Z INFO:wheel:adding 'fbgemm_gpu/docs/sparse_ops.py' 2025-05-07T20:24:49.2023497Z INFO:wheel:adding 'fbgemm_gpu/docs/version.py' 2025-05-07T20:24:49.2025959Z INFO:wheel:adding 'fbgemm_gpu/experimental/bench/__init__.py' 2025-05-07T20:24:49.2028713Z INFO:wheel:adding 'fbgemm_gpu/experimental/bench/ck_bf16_bench.py' 2025-05-07T20:24:49.2032521Z INFO:wheel:adding 'fbgemm_gpu/experimental/bench/comm_bench.py' 2025-05-07T20:24:49.2036735Z INFO:wheel:adding 'fbgemm_gpu/experimental/bench/gather_scatter_bench.py' 2025-05-07T20:24:49.2042622Z INFO:wheel:adding 'fbgemm_gpu/experimental/bench/quantize_bench.py' 2025-05-07T20:24:49.2054358Z INFO:wheel:adding 'fbgemm_gpu/experimental/bench/quantize_ops.py' 2025-05-07T20:24:49.2057038Z INFO:wheel:adding 'fbgemm_gpu/experimental/example/__init__.py' 2025-05-07T20:24:49.2210104Z INFO:wheel:adding 'fbgemm_gpu/experimental/example/fbgemm_gpu_experimental_example_py.so' 2025-05-07T20:24:49.2214533Z INFO:wheel:adding 'fbgemm_gpu/experimental/example/utils.py' 2025-05-07T20:24:49.2217055Z INFO:wheel:adding 'fbgemm_gpu/experimental/gemm/triton_gemm/__init__.py' 2025-05-07T20:24:49.2243177Z INFO:wheel:adding 'fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py' 2025-05-07T20:24:49.2252530Z INFO:wheel:adding 'fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py' 2025-05-07T20:24:49.2256280Z INFO:wheel:adding 'fbgemm_gpu/experimental/gemm/triton_gemm/matmul_perf_model.py' 2025-05-07T20:24:49.2258831Z INFO:wheel:adding 'fbgemm_gpu/experimental/gemm/triton_gemm/utils.py' 2025-05-07T20:24:49.2261172Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/__init__.py' 2025-05-07T20:24:51.2596436Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so' 2025-05-07T20:24:51.3298941Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/quantize.py' 2025-05-07T20:24:51.3299965Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/moe/README.md' 2025-05-07T20:24:51.3301778Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/moe/__init__.py' 2025-05-07T20:24:51.3304704Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/moe/activation.py' 2025-05-07T20:24:51.3308988Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/moe/gather_scatter.py' 2025-05-07T20:24:51.3318663Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/moe/layers.py' 2025-05-07T20:24:51.3322551Z INFO:wheel:adding 'fbgemm_gpu/experimental/gen_ai/moe/shuffling.py' 2025-05-07T20:24:51.3324312Z INFO:wheel:adding 'fbgemm_gpu/quantize/__init__.py' 2025-05-07T20:24:51.3326401Z INFO:wheel:adding 'fbgemm_gpu/quantize/quantize_ops.py' 2025-05-07T20:24:51.3328511Z INFO:wheel:adding 'fbgemm_gpu/sll/__init__.py' 2025-05-07T20:24:51.3330699Z INFO:wheel:adding 'fbgemm_gpu/sll/cpu/__init__.py' 2025-05-07T20:24:51.3336979Z INFO:wheel:adding 'fbgemm_gpu/sll/cpu/cpu_sll.py' 2025-05-07T20:24:51.3339144Z INFO:wheel:adding 'fbgemm_gpu/sll/meta/__init__.py' 2025-05-07T20:24:51.3341739Z INFO:wheel:adding 'fbgemm_gpu/sll/meta/meta_sll.py' 2025-05-07T20:24:51.3344317Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/__init__.py' 2025-05-07T20:24:51.3346624Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/common.py' 2025-05-07T20:24:51.3348251Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py' 2025-05-07T20:24:51.3350828Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py' 2025-05-07T20:24:51.3354719Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_bmm.py' 2025-05-07T20:24:51.3358604Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py' 2025-05-07T20:24:51.3360567Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py' 2025-05-07T20:24:51.3362927Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py' 2025-05-07T20:24:51.3368373Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py' 2025-05-07T20:24:51.3373420Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py' 2025-05-07T20:24:51.3375586Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py' 2025-05-07T20:24:51.3379398Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_jagged_softmax.py' 2025-05-07T20:24:51.3384561Z INFO:wheel:adding 'fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py' 2025-05-07T20:24:51.3386869Z INFO:wheel:adding 'fbgemm_gpu/tbe/__init__.py' 2025-05-07T20:24:51.3389002Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/__init__.py' 2025-05-07T20:24:51.3391241Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/bench_config.py' 2025-05-07T20:24:51.3396562Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/bench_runs.py' 2025-05-07T20:24:51.3398690Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/eeg_cli.py' 2025-05-07T20:24:51.3401107Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/embedding_ops_common_config.py' 2025-05-07T20:24:51.3403119Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/eval_compression.py' 2025-05-07T20:24:51.3404947Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/reporter.py' 2025-05-07T20:24:51.3408287Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/tbe_data_config.py' 2025-05-07T20:24:51.3410885Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/tbe_data_config_loader.py' 2025-05-07T20:24:51.3413472Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py' 2025-05-07T20:24:51.3415239Z INFO:wheel:adding 'fbgemm_gpu/tbe/bench/utils.py' 2025-05-07T20:24:51.3417093Z INFO:wheel:adding 'fbgemm_gpu/tbe/cache/__init__.py' 2025-05-07T20:24:51.3418910Z INFO:wheel:adding 'fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py' 2025-05-07T20:24:51.3420728Z INFO:wheel:adding 'fbgemm_gpu/tbe/ssd/__init__.py' 2025-05-07T20:24:51.3422486Z INFO:wheel:adding 'fbgemm_gpu/tbe/ssd/common.py' 2025-05-07T20:24:51.3428379Z INFO:wheel:adding 'fbgemm_gpu/tbe/ssd/inference.py' 2025-05-07T20:24:51.3452503Z INFO:wheel:adding 'fbgemm_gpu/tbe/ssd/training.py' 2025-05-07T20:24:51.3454729Z INFO:wheel:adding 'fbgemm_gpu/tbe/ssd/utils/__init__.py' 2025-05-07T20:24:51.3458010Z INFO:wheel:adding 'fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py' 2025-05-07T20:24:51.3459727Z INFO:wheel:adding 'fbgemm_gpu/tbe/stats/__init__.py' 2025-05-07T20:24:51.3462579Z INFO:wheel:adding 'fbgemm_gpu/tbe/stats/bench_params_reporter.py' 2025-05-07T20:24:51.3464387Z INFO:wheel:adding 'fbgemm_gpu/tbe/utils/__init__.py' 2025-05-07T20:24:51.3466268Z INFO:wheel:adding 'fbgemm_gpu/tbe/utils/common.py' 2025-05-07T20:24:51.3468145Z INFO:wheel:adding 'fbgemm_gpu/tbe/utils/offsets.py' 2025-05-07T20:24:51.3470770Z INFO:wheel:adding 'fbgemm_gpu/tbe/utils/quantize.py' 2025-05-07T20:24:51.3476411Z INFO:wheel:adding 'fbgemm_gpu/tbe/utils/requests.py' 2025-05-07T20:24:51.3478453Z INFO:wheel:adding 'fbgemm_gpu/triton/__init__.py' 2025-05-07T20:24:51.3480153Z INFO:wheel:adding 'fbgemm_gpu/triton/common.py' 2025-05-07T20:24:51.3488239Z INFO:wheel:adding 'fbgemm_gpu/triton/quantize.py' 2025-05-07T20:24:51.3492464Z INFO:wheel:adding 'fbgemm_gpu/triton/quantize_ref.py' 2025-05-07T20:24:51.3494362Z INFO:wheel:adding 'fbgemm_gpu/triton/jagged/__init__.py' 2025-05-07T20:24:51.3502292Z INFO:wheel:adding 'fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py' 2025-05-07T20:24:51.3504308Z INFO:wheel:adding 'fbgemm_gpu/utils/__init__.py' 2025-05-07T20:24:51.3506726Z INFO:wheel:adding 'fbgemm_gpu/utils/filestore.py' 2025-05-07T20:24:51.3508459Z INFO:wheel:adding 'fbgemm_gpu/utils/loader.py' 2025-05-07T20:24:51.3510736Z INFO:wheel:adding 'fbgemm_gpu/utils/torch_library.py' 2025-05-07T20:24:51.3514065Z INFO:wheel:adding 'fbgemm_gpu_genai-2025.5.7+cu128.dist-info/METADATA' 2025-05-07T20:24:51.3515709Z INFO:wheel:adding 'fbgemm_gpu_genai-2025.5.7+cu128.dist-info/WHEEL' 2025-05-07T20:24:51.3516215Z INFO:wheel:adding 'fbgemm_gpu_genai-2025.5.7+cu128.dist-info/top_level.txt' 2025-05-07T20:24:51.3522238Z INFO:wheel:adding 'fbgemm_gpu_genai-2025.5.7+cu128.dist-info/RECORD' 2025-05-07T20:24:51.3526317Z INFO:root:removing _skbuild/linux-x86_64-3.9/setuptools/bdist.linux-x86_64/wheel 2025-05-07T20:24:51.3789134Z ╒════════════════════════════╤════════════════════════════════════════════════╕ 2025-05-07T20:24:51.3789597Z │ │ Version │ 2025-05-07T20:24:51.3790054Z ╞════════════════════════════╪════════════════════════════════════════════════╡ 2025-05-07T20:24:51.3790500Z │ PyTorch │ 2.8.0.dev20250507+cu128 │ 2025-05-07T20:24:51.3790995Z ├────────────────────────────┼────────────────────────────────────────────────┤ 2025-05-07T20:24:51.3791463Z │ CUDA (Declared by PyTorch) │ 12.8 │ 2025-05-07T20:24:51.3792100Z ├────────────────────────────┼────────────────────────────────────────────────┤ 2025-05-07T20:24:51.3792616Z │ CUDA (Actual) │ nvcc: NVIDIA (R) Cuda compiler driver │ 2025-05-07T20:24:51.3793131Z │ │ Copyright (c) 2005-2025 NVIDIA Corporation │ 2025-05-07T20:24:51.3793715Z │ │ Built on Wed_Jan_15_19:20:09_PST_2025 │ 2025-05-07T20:24:51.3794147Z │ │ Cuda compilation tools, release 12.8, V12.8.61 │ 2025-05-07T20:24:51.3794594Z │ │ Build cuda_12.8.r12.8/compiler.35404655_0 │ 2025-05-07T20:24:51.3795111Z ╘════════════════════════════╧════════════════════════════════════════════════╛ 2025-05-07T20:24:51.7112346Z Successfully built fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:51.7680970Z 2025-05-07T20:24:51.7886643Z ################################################################################ 2025-05-07T20:24:51.7887435Z [CHECK] BUILT LIBRARY: ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.7887980Z [CHECK] Listing out library size: 2025-05-07T20:24:51.7888517Z + du -h --block-size=1M ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.7889035Z 2025-05-07T20:24:51.7940997Z 91 ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.7943457Z 2025-05-07T20:24:51.7944431Z [CHECK] Listing out the GLIBC versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.7945559Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so | grep GLIBC_ | sed 's/.*GLIBC_\([.0-9]*\).*/GLIBC_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.7946211Z 2025-05-07T20:24:51.8103820Z GLIBC_2.2.5 2025-05-07T20:24:51.8110823Z GLIBC_2.3 2025-05-07T20:24:51.8111082Z GLIBC_2.4 2025-05-07T20:24:51.8111529Z GLIBC_2.14 2025-05-07T20:24:51.8111699Z 2025-05-07T20:24:51.8111705Z 2025-05-07T20:24:51.8112510Z [CHECK] Listing out the GLIBCXX versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.8113779Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so | grep GLIBCXX_ | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.8114451Z 2025-05-07T20:24:51.8218220Z GLIBCXX_3.4 2025-05-07T20:24:51.8218498Z GLIBCXX_3.4.9 2025-05-07T20:24:51.8218771Z GLIBCXX_3.4.11 2025-05-07T20:24:51.8219038Z GLIBCXX_3.4.18 2025-05-07T20:24:51.8219292Z GLIBCXX_3.4.20 2025-05-07T20:24:51.8219545Z GLIBCXX_3.4.21 2025-05-07T20:24:51.8221644Z 2025-05-07T20:24:51.8221777Z 2025-05-07T20:24:51.8261430Z + nm -gDC ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so > /tmp/tmp.WBrts6tHFV.symbols.txt 2025-05-07T20:24:51.8262734Z 2025-05-07T20:24:51.8437106Z 2025-05-07T20:24:51.8501898Z [CHECK] Total Number of symbols: 1953 2025-05-07T20:24:51.8525699Z [CHECK] Number of fbgemm symbols: 619 2025-05-07T20:24:51.8556668Z + nm -gDCu ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so > /tmp/tmp.NayCG6BOn7.usymbols.txt 2025-05-07T20:24:51.8557381Z 2025-05-07T20:24:51.8593806Z 2025-05-07T20:24:51.8633413Z [CHECK] Listing out undefined symbols (281 total): 2025-05-07T20:24:51.8668274Z U __assert_fail@GLIBC_2.2.5 2025-05-07T20:24:51.8668666Z U at::cuda::detail::getDefaultCUDAGenerator(signed char) 2025-05-07T20:24:51.8669137Z U at::CUDAGeneratorImpl::device_type() 2025-05-07T20:24:51.8669640Z U at::CUDAGeneratorImpl::philox_cuda_state(unsigned long) 2025-05-07T20:24:51.8670116Z U at::cuda::getCurrentDeviceProperties() 2025-05-07T20:24:51.8670572Z U at::_ops::add__Tensor::call(at::Tensor&, at::Tensor const&, c10::Scalar const&) 2025-05-07T20:24:51.8671055Z U at::_ops::div__Scalar::call(at::Tensor&, c10::Scalar const&) 2025-05-07T20:24:51.8671978Z U at::_ops::empty_like::call(at::Tensor const&, std::optional, std::optional, std::optional, std::optional, std::optional) 2025-05-07T20:24:51.8674503Z U at::_ops::empty_memory_format::call(c10::ArrayRef, std::optional, std::optional, std::optional, std::optional, std::optional) 2025-05-07T20:24:51.8675416Z U at::_ops::expand::call(at::Tensor const&, c10::ArrayRef, bool) 2025-05-07T20:24:51.8676027Z U at::_ops::index_select::call(at::Tensor const&, long, at::Tensor const&) 2025-05-07T20:24:51.8676498Z U at::_ops::norm_Scalar::call(at::Tensor const&, c10::Scalar const&) 2025-05-07T20:24:51.8677000Z U at::_ops::scatter_add_::call(at::Tensor&, long, at::Tensor const&, at::Tensor const&) 2025-05-07T20:24:51.8677483Z U at::_ops::select_int::call(at::Tensor const&, long, c10::SymInt) 2025-05-07T20:24:51.8677967Z U at::_ops::split_sizes::call(at::Tensor const&, c10::ArrayRef, long) 2025-05-07T20:24:51.8678634Z U at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef, bool, std::optional) 2025-05-07T20:24:51.8679341Z U at::_ops::to_dtype::call(at::Tensor const&, c10::ScalarType, bool, bool, std::optional) 2025-05-07T20:24:51.8680301Z U at::_ops::to_dtype_layout::call(at::Tensor const&, std::optional, std::optional, std::optional, std::optional, bool, bool, std::optional) 2025-05-07T20:24:51.8681116Z U at::_ops::unsqueeze::call(at::Tensor const&, long) 2025-05-07T20:24:51.8681590Z U at::_ops::view::call(at::Tensor const&, c10::ArrayRef) 2025-05-07T20:24:51.8682304Z U at::_ops::zeros::call(c10::ArrayRef, std::optional, std::optional, std::optional, std::optional) 2025-05-07T20:24:51.8683001Z U at::tensor(c10::ArrayRef, c10::TensorOptions const&) 2025-05-07T20:24:51.8683364Z U at::TensorMaker::make_tensor() 2025-05-07T20:24:51.8683736Z U c10::AutogradMetaInterface::~AutogradMetaInterface() 2025-05-07T20:24:51.8684162Z U c10::BFloat16* at::TensorBase::data_ptr() const 2025-05-07T20:24:51.8684626Z U c10::BFloat16* at::TensorBase::mutable_data_ptr() const 2025-05-07T20:24:51.8685326Z U c10::BoolType::get() 2025-05-07T20:24:51.8685808Z U c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) 2025-05-07T20:24:51.8686286Z U c10::cuda::CUDACachingAllocator::allocator 2025-05-07T20:24:51.8686627Z U c10::cuda::CUDAStream::stream() const 2025-05-07T20:24:51.8687033Z U c10::cuda::current_device() 2025-05-07T20:24:51.8687331Z U c10::cuda::device_count() 2025-05-07T20:24:51.8687640Z U c10::cuda::ExchangeDevice(signed char) 2025-05-07T20:24:51.8687985Z U c10::cuda::getCurrentCUDAStream(signed char) 2025-05-07T20:24:51.8688349Z U c10::cuda::getDefaultCUDAStream(signed char) 2025-05-07T20:24:51.8688687Z U c10::cuda::GetDevice(signed char*) 2025-05-07T20:24:51.8689032Z U c10::cuda::getStreamFromPool(bool, signed char) 2025-05-07T20:24:51.8689404Z U c10::cuda::getStreamFromPool(int, signed char) 2025-05-07T20:24:51.8689757Z U c10::cuda::MaybeSetDevice(signed char) 2025-05-07T20:24:51.8690127Z U c10::cuda::setCurrentCUDAStream(c10::cuda::CUDAStream) 2025-05-07T20:24:51.8690484Z U c10::cuda::SetDevice(signed char) 2025-05-07T20:24:51.8690799Z U c10::cuda::warn_or_error_on_sync() 2025-05-07T20:24:51.8691482Z U c10::detail::infer_schema::make_function_schema(c10::ArrayRef, c10::ArrayRef) 2025-05-07T20:24:51.8692526Z U c10::detail::ListImpl::ListImpl(std::vector >, c10::Type::SingletonOrSharedTypePtr) 2025-05-07T20:24:51.8693263Z U c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) 2025-05-07T20:24:51.8694112Z U c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string, std::allocator > const&) 2025-05-07T20:24:51.8695003Z U c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) 2025-05-07T20:24:51.8695960Z U c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string, std::allocator > const&) 2025-05-07T20:24:51.8697034Z U c10d::getNcclErrorDetailStr(ncclResult_t, std::optional, std::allocator > >) 2025-05-07T20:24:51.8697733Z U c10d::ncclGetErrorWithVersion[abi:cxx11](ncclResult_t) 2025-05-07T20:24:51.8698355Z U c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string, std::allocator >) 2025-05-07T20:24:51.8698894Z U c10::Error::what() const 2025-05-07T20:24:51.8699310Z U c10::Float8_e4m3fn* at::TensorBase::mutable_data_ptr() const 2025-05-07T20:24:51.8699725Z U c10::FloatType::get() 2025-05-07T20:24:51.8700080Z U c10::GeneratorImpl::device() const 2025-05-07T20:24:51.8700386Z U c10::get_default_dtype() 2025-05-07T20:24:51.8700713Z U c10::impl::cow::is_cow_data_ptr(c10::DataPtr const&) 2025-05-07T20:24:51.8701126Z U c10::impl::cow::materialize_cow_storage(c10::StorageImpl&) 2025-05-07T20:24:51.8701501Z U c10::impl::device_guard_impl_registry 2025-05-07T20:24:51.8701904Z U c10::impl::ExcludeDispatchKeyGuard::~ExcludeDispatchKeyGuard() 2025-05-07T20:24:51.8702458Z U c10::impl::ExcludeDispatchKeyGuard::ExcludeDispatchKeyGuard(c10::DispatchKeySet) 2025-05-07T20:24:51.8702926Z U c10::impl::GPUTrace::gpuTraceState 2025-05-07T20:24:51.8703248Z U c10::impl::GPUTrace::haveState 2025-05-07T20:24:51.8703534Z U c10::IntType::get() 2025-05-07T20:24:51.8703831Z U c10::IValue::isTensorList() const 2025-05-07T20:24:51.8704171Z U c10::IValue::reportToTensorTypeError() const 2025-05-07T20:24:51.8704856Z U c10::ListType::get(std::__cxx11::basic_string, std::allocator > const&, c10::Type::SingletonOrSharedTypePtr) 2025-05-07T20:24:51.8705567Z U c10::MessageLogger::~MessageLogger() 2025-05-07T20:24:51.8705933Z U c10::MessageLogger::MessageLogger(char const*, int, int) 2025-05-07T20:24:51.8706313Z U c10::operator*(c10::SymInt const&, int) 2025-05-07T20:24:51.8706639Z U c10::operator-(c10::SymInt const&, int) 2025-05-07T20:24:51.8706971Z U c10::operator-(c10::SymInt const&, long) 2025-05-07T20:24:51.8707326Z U c10::operator<<(std::ostream&, c10::Device const&) 2025-05-07T20:24:51.8707690Z U c10::operator<<(std::ostream&, c10::DeviceType) 2025-05-07T20:24:51.8708129Z U c10::OptionalType::get(c10::Type::SingletonOrSharedTypePtr) 2025-05-07T20:24:51.8708542Z U c10::ScalarTypeType::get() 2025-05-07T20:24:51.8708895Z U c10::StorageImpl::throw_data_ptr_access_error() const 2025-05-07T20:24:51.8709246Z U c10::StringType::get() 2025-05-07T20:24:51.8709620Z U c10::SymbolicShapeMeta::init_is_contiguous() const 2025-05-07T20:24:51.8710005Z U c10::SymBool::guard_bool(char const*, long) const 2025-05-07T20:24:51.8710377Z U c10::SymFloat::guard_float(char const*, long) const 2025-05-07T20:24:51.8710750Z U c10::SymInt::guard_int(char const*, long) const 2025-05-07T20:24:51.8711401Z U c10::SymInt::SymInt(c10::intrusive_ptr >) 2025-05-07T20:24:51.8712075Z U c10::SymInt::toSymNode() const 2025-05-07T20:24:51.8712699Z U c10::TensorImpl::set_autograd_meta(std::unique_ptr >) 2025-05-07T20:24:51.8713354Z U c10::TensorImpl::throw_data_ptr_access_error() const 2025-05-07T20:24:51.8713703Z U c10::TensorType::get() 2025-05-07T20:24:51.8714007Z U c10::throwNullDataPtrError() 2025-05-07T20:24:51.8714324Z U c10::UndefinedTensorImpl::_singleton 2025-05-07T20:24:51.8714638Z U c10::warn(c10::Warning const&) 2025-05-07T20:24:51.8714936Z U c10::warnDeprecatedDataPtr() 2025-05-07T20:24:51.8715807Z U c10::Warning::Warning(std::variant, c10::SourceLocation const&, std::__cxx11::basic_string, std::allocator >, bool) 2025-05-07T20:24:51.8716751Z U caffe2::TypeMeta::error_unsupported_typemeta(caffe2::TypeMeta) 2025-05-07T20:24:51.8717149Z U caffe2::TypeMeta::typeMetaDatas() 2025-05-07T20:24:51.8717493Z U cublasLtCreate 2025-05-07T20:24:51.8717743Z U cublasLtMatmul 2025-05-07T20:24:51.8718024Z U cublasLtMatmulAlgoGetHeuristic 2025-05-07T20:24:51.8718332Z U cublasLtMatmulDescCreate 2025-05-07T20:24:51.8718642Z U cublasLtMatmulDescSetAttribute 2025-05-07T20:24:51.8718957Z U cublasLtMatmulPreferenceCreate 2025-05-07T20:24:51.8719292Z U cublasLtMatmulPreferenceSetAttribute 2025-05-07T20:24:51.8719622Z U cublasLtMatrixLayoutCreate 2025-05-07T20:24:51.8719942Z U cudaDeviceGetAttribute@libcudart.so.12 2025-05-07T20:24:51.8720290Z U cudaDeviceSynchronize@libcudart.so.12 2025-05-07T20:24:51.8720640Z U cudaEventCreateWithFlags@libcudart.so.12 2025-05-07T20:24:51.8720982Z U cudaEventDestroy@libcudart.so.12 2025-05-07T20:24:51.8721308Z U cudaEventElapsedTime@libcudart.so.12 2025-05-07T20:24:51.8721637Z U cudaEventQuery@libcudart.so.12 2025-05-07T20:24:51.8721956Z U cudaEventRecord@libcudart.so.12 2025-05-07T20:24:51.8722280Z U cudaEventSynchronize@libcudart.so.12 2025-05-07T20:24:51.8722684Z U cudaFree@libcudart.so.12 2025-05-07T20:24:51.8722990Z U cudaFuncSetAttribute@libcudart.so.12 2025-05-07T20:24:51.8723310Z U cudaGetDevice@libcudart.so.12 2025-05-07T20:24:51.8723645Z U cudaGetDeviceProperties_v2@libcudart.so.12 2025-05-07T20:24:51.8724008Z U cudaGetDriverEntryPoint@libcudart.so.12 2025-05-07T20:24:51.8724337Z U cudaGetErrorName@libcudart.so.12 2025-05-07T20:24:51.8724668Z U cudaGetErrorString@libcudart.so.12 2025-05-07T20:24:51.8724988Z U cudaGetLastError@libcudart.so.12 2025-05-07T20:24:51.8725310Z U cudaIpcGetMemHandle@libcudart.so.12 2025-05-07T20:24:51.8725646Z U cudaIpcOpenMemHandle@libcudart.so.12 2025-05-07T20:24:51.8725994Z U cudaLaunchCooperativeKernel@libcudart.so.12 2025-05-07T20:24:51.8726353Z U cudaLaunchKernelExC@libcudart.so.12 2025-05-07T20:24:51.8726672Z U cudaLaunchKernel@libcudart.so.12 2025-05-07T20:24:51.8727028Z U cudaMalloc@libcudart.so.12 2025-05-07T20:24:51.8727333Z U cudaMemcpyAsync@libcudart.so.12 2025-05-07T20:24:51.8727635Z U cudaMemcpy@libcudart.so.12 2025-05-07T20:24:51.8727942Z U cudaMemsetAsync@libcudart.so.12 2025-05-07T20:24:51.8728278Z U __cudaPopCallConfiguration@libcudart.so.12 2025-05-07T20:24:51.8728652Z U __cudaPushCallConfiguration@libcudart.so.12 2025-05-07T20:24:51.8729050Z U __cudaRegisterFatBinaryEnd@libcudart.so.12 2025-05-07T20:24:51.8729408Z U __cudaRegisterFatBinary@libcudart.so.12 2025-05-07T20:24:51.8729752Z U __cudaRegisterFunction@libcudart.so.12 2025-05-07T20:24:51.8730085Z U __cudaRegisterVar@libcudart.so.12 2025-05-07T20:24:51.8730401Z U cudaStreamQuery@libcudart.so.12 2025-05-07T20:24:51.8730728Z U cudaStreamSynchronize@libcudart.so.12 2025-05-07T20:24:51.8731068Z U cudaStreamWaitEvent@libcudart.so.12 2025-05-07T20:24:51.8731405Z U __cudaUnregisterFatBinary@libcudart.so.12 2025-05-07T20:24:51.8731746Z U __cxa_allocate_exception@CXXABI_1.3 2025-05-07T20:24:51.8732043Z U __cxa_atexit@GLIBC_2.2.5 2025-05-07T20:24:51.8732332Z U __cxa_begin_catch@CXXABI_1.3 2025-05-07T20:24:51.8732629Z U __cxa_end_catch@CXXABI_1.3 2025-05-07T20:24:51.8732923Z U __cxa_free_exception@CXXABI_1.3 2025-05-07T20:24:51.8733222Z U __cxa_guard_abort@CXXABI_1.3 2025-05-07T20:24:51.8733513Z U __cxa_guard_acquire@CXXABI_1.3 2025-05-07T20:24:51.8733860Z U __cxa_guard_release@CXXABI_1.3 2025-05-07T20:24:51.8734152Z U __cxa_rethrow@CXXABI_1.3 2025-05-07T20:24:51.8734446Z U __cxa_thread_atexit@CXXABI_1.3.7 2025-05-07T20:24:51.8734737Z U __cxa_throw@CXXABI_1.3 2025-05-07T20:24:51.8735018Z U dlclose@GLIBC_2.2.5 2025-05-07T20:24:51.8735290Z U dlopen@GLIBC_2.2.5 2025-05-07T20:24:51.8735551Z U dlsym@GLIBC_2.2.5 2025-05-07T20:24:51.8735812Z U exit@GLIBC_2.2.5 2025-05-07T20:24:51.8736066Z U fclose@GLIBC_2.2.5 2025-05-07T20:24:51.8736332Z U fflush@GLIBC_2.2.5 2025-05-07T20:24:51.8736642Z U float* at::TensorBase::data_ptr() const 2025-05-07T20:24:51.8737037Z U float* at::TensorBase::mutable_data_ptr() const 2025-05-07T20:24:51.8737373Z U fopen@GLIBC_2.2.5 2025-05-07T20:24:51.8737642Z U fprintf@GLIBC_2.2.5 2025-05-07T20:24:51.8737914Z U fread@GLIBC_2.2.5 2025-05-07T20:24:51.8738170Z U fwrite@GLIBC_2.2.5 2025-05-07T20:24:51.8738435Z U getenv@GLIBC_2.2.5 2025-05-07T20:24:51.8738755Z U __gxx_personality_v0@CXXABI_1.3 2025-05-07T20:24:51.8739091Z U int* at::TensorBase::data_ptr() const 2025-05-07T20:24:51.8739444Z U int* at::TensorBase::mutable_data_ptr() const 2025-05-07T20:24:51.8739810Z U long* at::TensorBase::data_ptr() const 2025-05-07T20:24:51.8740205Z U long c10::detail::maybe_wrap_dim_slow(long, long, bool) 2025-05-07T20:24:51.8740559Z U memcmp@GLIBC_2.2.5 2025-05-07T20:24:51.8740823Z U memcpy@GLIBC_2.14 2025-05-07T20:24:51.8741086Z U memmove@GLIBC_2.2.5 2025-05-07T20:24:51.8741356Z U memset@GLIBC_2.2.5 2025-05-07T20:24:51.8741611Z U ncclAllGather 2025-05-07T20:24:51.8741863Z U ncclAllReduce 2025-05-07T20:24:51.8742104Z U ncclCommInitRank 2025-05-07T20:24:51.8742361Z U ncclGetUniqueId 2025-05-07T20:24:51.8742618Z U ncclReduceScatter 2025-05-07T20:24:51.8742932Z U operator delete(void*, unsigned long)@CXXABI_1.3.9 2025-05-07T20:24:51.8743344Z U operator new(unsigned long)@GLIBCXX_3.4 2025-05-07T20:24:51.8743648Z U printf@GLIBC_2.2.5 2025-05-07T20:24:51.8743919Z U sched_yield@GLIBC_2.2.5 2025-05-07T20:24:51.8744267Z U signed char* at::TensorBase::data_ptr() const 2025-05-07T20:24:51.8744721Z U signed char* at::TensorBase::mutable_data_ptr() const 2025-05-07T20:24:51.8745151Z U __stack_chk_fail@GLIBC_2.4 2025-05-07T20:24:51.8745569Z U std::basic_ios >::clear(std::_Ios_Iostate)@GLIBCXX_3.4 2025-05-07T20:24:51.8746228Z U std::basic_ios >::init(std::basic_streambuf >*)@GLIBCXX_3.4 2025-05-07T20:24:51.8746877Z U std::basic_iostream >::~basic_iostream()@GLIBCXX_3.4 2025-05-07T20:24:51.8747456Z U std::basic_iostream >::~basic_iostream()@GLIBCXX_3.4 2025-05-07T20:24:51.8748167Z U std::basic_ios >::init(std::basic_streambuf >*)@GLIBCXX_3.4 2025-05-07T20:24:51.8749103Z U std::basic_ostream >& std::operator<< >(std::basic_ostream >&, char const*)@GLIBCXX_3.4 2025-05-07T20:24:51.8750215Z U std::basic_ostream >& std::__ostream_insert >(std::basic_ostream >&, char const*, long)@GLIBCXX_3.4.9 2025-05-07T20:24:51.8751305Z U std::basic_streambuf >::basic_streambuf(std::basic_streambuf > const&)@GLIBCXX_3.4 2025-05-07T20:24:51.8752350Z U std::basic_streambuf >::basic_streambuf(std::basic_streambuf > const&)@GLIBCXX_3.4 2025-05-07T20:24:51.8752992Z U std::cerr@GLIBCXX_3.4 2025-05-07T20:24:51.8753272Z U std::cout@GLIBCXX_3.4 2025-05-07T20:24:51.8753598Z U std::ctype::_M_widen_init() const@GLIBCXX_3.4.11 2025-05-07T20:24:51.8754175Z U std::__cxx11::basic_ostringstream, std::allocator >::basic_ostringstream() 2025-05-07T20:24:51.8754960Z U std::__cxx11::basic_ostringstream, std::allocator >::~basic_ostringstream()@GLIBCXX_3.4.21 2025-05-07T20:24:51.8755894Z U std::__cxx11::basic_stringbuf, std::allocator >::_M_pbump(char*, char*, long)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8756768Z U std::__cxx11::basic_stringbuf, std::allocator >::_M_sync(char*, unsigned long, unsigned long)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8757694Z U std::__cxx11::basic_stringbuf, std::allocator >::__xfer_bufptrs::~__xfer_bufptrs()@GLIBCXX_3.4.21 2025-05-07T20:24:51.8758568Z U std::__cxx11::basic_stringbuf, std::allocator >::_M_pbump(wchar_t*, wchar_t*, long)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8759508Z U std::__cxx11::basic_stringbuf, std::allocator >::_M_sync(wchar_t*, unsigned long, unsigned long)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8760445Z U std::__cxx11::basic_stringbuf, std::allocator >::__xfer_bufptrs::~__xfer_bufptrs()@GLIBCXX_3.4.21 2025-05-07T20:24:51.8761289Z U std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const@GLIBCXX_3.4.18 2025-05-07T20:24:51.8761832Z U stderr@GLIBC_2.2.5 2025-05-07T20:24:51.8762178Z U std::exception::~exception()@GLIBCXX_3.4 2025-05-07T20:24:51.8762526Z U std::exception::what() const@GLIBCXX_3.4 2025-05-07T20:24:51.8762864Z U std::ios_base::Init::~Init()@GLIBCXX_3.4 2025-05-07T20:24:51.8763190Z U std::ios_base::Init::Init()@GLIBCXX_3.4 2025-05-07T20:24:51.8763517Z U std::ios_base::~ios_base()@GLIBCXX_3.4 2025-05-07T20:24:51.8763834Z U std::ios_base::ios_base()@GLIBCXX_3.4 2025-05-07T20:24:51.8764201Z U std::locale::~locale()@GLIBCXX_3.4 2025-05-07T20:24:51.8764514Z U std::locale::locale()@GLIBCXX_3.4 2025-05-07T20:24:51.8764882Z U std::logic_error::logic_error(char const*)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8765272Z U std::logic_error::~logic_error()@GLIBCXX_3.4 2025-05-07T20:24:51.8765600Z U std::ostream::flush()@GLIBCXX_3.4 2025-05-07T20:24:51.8765929Z U std::ostream::operator<<(int)@GLIBCXX_3.4 2025-05-07T20:24:51.8766256Z U std::ostream::put(char)@GLIBCXX_3.4 2025-05-07T20:24:51.8766640Z U std::ostream& std::ostream::_M_insert(long)@GLIBCXX_3.4.9 2025-05-07T20:24:51.8767138Z U std::ostream& std::ostream::_M_insert(unsigned long)@GLIBCXX_3.4.9 2025-05-07T20:24:51.8767678Z U std::ostream& std::ostream::_M_insert(void const*)@GLIBCXX_3.4.9 2025-05-07T20:24:51.8768167Z U std::runtime_error::runtime_error(char const*)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8768569Z U std::runtime_error::~runtime_error()@GLIBCXX_3.4 2025-05-07T20:24:51.8769243Z U std::runtime_error::runtime_error(std::__cxx11::basic_string, std::allocator > const&)@GLIBCXX_3.4.21 2025-05-07T20:24:51.8769848Z U std::terminate()@GLIBCXX_3.4 2025-05-07T20:24:51.8770155Z U std::__throw_bad_alloc()@GLIBCXX_3.4 2025-05-07T20:24:51.8770475Z U std::__throw_bad_array_new_length() 2025-05-07T20:24:51.8770792Z U std::__throw_bad_cast()@GLIBCXX_3.4 2025-05-07T20:24:51.8771127Z U std::__throw_length_error(char const*)@GLIBCXX_3.4 2025-05-07T20:24:51.8771497Z U std::__throw_logic_error(char const*)@GLIBCXX_3.4 2025-05-07T20:24:51.8771891Z U std::__throw_out_of_range_fmt(char const*, ...)@GLIBCXX_3.4.20 2025-05-07T20:24:51.8772288Z U std::__throw_system_error(int)@GLIBCXX_3.4.11 2025-05-07T20:24:51.8772605Z U strlen@GLIBC_2.2.5 2025-05-07T20:24:51.8772874Z U __tls_get_addr@GLIBC_2.3 2025-05-07T20:24:51.8773184Z U torch::CppFunction::~CppFunction() 2025-05-07T20:24:51.8773701Z U torch::cuda::nccl::all2all_single_equal_split(at::Tensor&, at::Tensor&, int, void*, c10::cuda::CUDAStream&) 2025-05-07T20:24:51.8774610Z U torch::cuda::nccl::all2all(std::vector >&, std::vector >&, void*, c10::cuda::CUDAStream&) 2025-05-07T20:24:51.8775520Z U torch::jit::parseSchema(std::__cxx11::basic_string, std::allocator > const&, bool) 2025-05-07T20:24:51.8776381Z U torch::Library::_def(c10::FunctionSchema&&, c10::OperatorName*, std::vector > const&, torch::_RegisterOrVerify) & 2025-05-07T20:24:51.8777154Z U torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & 2025-05-07T20:24:51.8778048Z U torch::Library::Library(torch::Library::Kind, std::__cxx11::basic_string, std::allocator >, std::optional, char const*, unsigned int) 2025-05-07T20:24:51.8778785Z U typeinfo for c10::Error 2025-05-07T20:24:51.8779103Z U typeinfo for std::exception@GLIBCXX_3.4 2025-05-07T20:24:51.8779482Z U typeinfo for std::logic_error@GLIBCXX_3.4 2025-05-07T20:24:51.8779833Z U typeinfo for std::runtime_error@GLIBCXX_3.4 2025-05-07T20:24:51.8780147Z U __udivti3@GCC_3.0 2025-05-07T20:24:51.8780522Z U unsigned char* at::TensorBase::mutable_data_ptr() const 2025-05-07T20:24:51.8780924Z U _Unwind_Resume@GCC_3.0 2025-05-07T20:24:51.8781239Z U usleep@GLIBC_2.2.5 2025-05-07T20:24:51.8781515Z U vtable for c10::Error 2025-05-07T20:24:51.8781841Z U vtable for __cxxabiv1::__class_type_info@CXXABI_1.3 2025-05-07T20:24:51.8782246Z U vtable for __cxxabiv1::__function_type_info@CXXABI_1.3 2025-05-07T20:24:51.8782654Z U vtable for __cxxabiv1::__si_class_type_info@CXXABI_1.3 2025-05-07T20:24:51.8783094Z U vtable for std::basic_ios >@GLIBCXX_3.4 2025-05-07T20:24:51.8783605Z U vtable for std::basic_ios >@GLIBCXX_3.4 2025-05-07T20:24:51.8784123Z U vtable for std::basic_streambuf >@GLIBCXX_3.4 2025-05-07T20:24:51.8784672Z U vtable for std::basic_streambuf >@GLIBCXX_3.4 2025-05-07T20:24:51.8785804Z U vtable for std::__cxx11::basic_istringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8786589Z U vtable for std::__cxx11::basic_istringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8787480Z U vtable for std::__cxx11::basic_ostringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8788262Z U vtable for std::__cxx11::basic_ostringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8789023Z U vtable for std::__cxx11::basic_stringbuf, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8789771Z U vtable for std::__cxx11::basic_stringbuf, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8790526Z U vtable for std::__cxx11::basic_stringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8791289Z U vtable for std::__cxx11::basic_stringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8791941Z U vtable for torch::autograd::AutogradMeta 2025-05-07T20:24:51.8792469Z U VTT for std::__cxx11::basic_istringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8793279Z U VTT for std::__cxx11::basic_istringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8794026Z U VTT for std::__cxx11::basic_ostringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8794781Z U VTT for std::__cxx11::basic_ostringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8795560Z U VTT for std::__cxx11::basic_stringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8796301Z U VTT for std::__cxx11::basic_stringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.8796821Z U wmemcpy@GLIBC_2.2.5 2025-05-07T20:24:51.8797097Z U wmemmove@GLIBC_2.2.5 2025-05-07T20:24:51.8797362Z U __xstat@GLIBC_2.2.5 2025-05-07T20:24:51.8797643Z w __cxa_finalize@GLIBC_2.2.5 2025-05-07T20:24:51.8797916Z w __gmon_start__ 2025-05-07T20:24:51.8798248Z w _ITM_deregisterTMCloneTable 2025-05-07T20:24:51.8798549Z w _ITM_registerTMCloneTable 2025-05-07T20:24:51.8798824Z w __pthread_key_create 2025-05-07T20:24:51.8799110Z w pthread_mutex_lock@GLIBC_2.2.5 2025-05-07T20:24:51.8799415Z w pthread_mutex_unlock@GLIBC_2.2.5 2025-05-07T20:24:51.8799767Z [CHECK] Listing out external shared libraries linked: 2025-05-07T20:24:51.8800355Z + ldd ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.8800761Z 2025-05-07T20:24:51.8800885Z linux-vdso.so.1 (0x00007fff38160000) 2025-05-07T20:24:51.8801164Z libtorch.so => not found 2025-05-07T20:24:51.8801392Z libc10.so => not found 2025-05-07T20:24:51.8801621Z libnvrtc.so.12 => not found 2025-05-07T20:24:51.8801860Z libc10_cuda.so => not found 2025-05-07T20:24:51.8802104Z libnccl.so.2 => not found 2025-05-07T20:24:51.8802411Z libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f5736200000) 2025-05-07T20:24:51.8802844Z libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f5735000000) 2025-05-07T20:24:51.8803224Z libtorch_cpu.so => not found 2025-05-07T20:24:51.8803470Z libtorch_cuda.so => not found 2025-05-07T20:24:51.8803718Z libcudart.so.12 => not found 2025-05-07T20:24:51.8804008Z libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f5734dfc000) 2025-05-07T20:24:51.8804409Z libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f5734a67000) 2025-05-07T20:24:51.8804822Z libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f573484f000) 2025-05-07T20:24:51.8805202Z libc.so.6 => /usr/lib64/libc.so.6 (0x00007f5734479000) 2025-05-07T20:24:51.8805591Z /lib64/ld-linux-x86-64.so.2 (0x00007f57404ae000) 2025-05-07T20:24:51.8805930Z libm.so.6 => /usr/lib64/libm.so.6 (0x00007f57340f7000) 2025-05-07T20:24:51.8806332Z libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f5733ed7000) 2025-05-07T20:24:51.8806738Z librt.so.1 => /usr/lib64/librt.so.1 (0x00007f5733ccf000) 2025-05-07T20:24:51.8806982Z 2025-05-07T20:24:51.8807084Z [CHECK] Displaying ELF information: 2025-05-07T20:24:51.8807594Z + readelf -d ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:24:51.8808023Z 2025-05-07T20:24:51.8808027Z 2025-05-07T20:24:51.8808179Z Dynamic section at offset 0x5a6fa90 contains 39 entries: 2025-05-07T20:24:51.8808541Z Tag Type Name/Value 2025-05-07T20:24:51.8808928Z 0x0000000000000001 (NEEDED) Shared library: [libtorch.so] 2025-05-07T20:24:51.8809396Z 0x0000000000000001 (NEEDED) Shared library: [libc10.so] 2025-05-07T20:24:51.8809856Z 0x0000000000000001 (NEEDED) Shared library: [libnvrtc.so.12] 2025-05-07T20:24:51.8810332Z 0x0000000000000001 (NEEDED) Shared library: [libc10_cuda.so] 2025-05-07T20:24:51.8810800Z 0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2] 2025-05-07T20:24:51.8811305Z 0x0000000000000001 (NEEDED) Shared library: [libcuda.so.1] 2025-05-07T20:24:51.8811786Z 0x0000000000000001 (NEEDED) Shared library: [libnvidia-ml.so.1] 2025-05-07T20:24:51.8812264Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cpu.so] 2025-05-07T20:24:51.8812746Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cuda.so] 2025-05-07T20:24:51.8813217Z 0x0000000000000001 (NEEDED) Shared library: [libcudart.so.12] 2025-05-07T20:24:51.8813688Z 0x0000000000000001 (NEEDED) Shared library: [libdl.so.2] 2025-05-07T20:24:51.8814152Z 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6] 2025-05-07T20:24:51.8814618Z 0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1] 2025-05-07T20:24:51.8815077Z 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 2025-05-07T20:24:51.8815544Z 0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2] 2025-05-07T20:24:51.8816125Z 0x000000000000000e (SONAME) Library soname: [fbgemm_gpu_experimental_gen_ai.so] 2025-05-07T20:24:51.8816562Z 0x000000000000000c (INIT) 0x5c000 2025-05-07T20:24:51.8816869Z 0x000000000000000d (FINI) 0x4a74b4 2025-05-07T20:24:51.8817184Z 0x0000000000000019 (INIT_ARRAY) 0x5a6f148 2025-05-07T20:24:51.8817510Z 0x000000000000001b (INIT_ARRAYSZ) 1136 (bytes) 2025-05-07T20:24:51.8817840Z 0x000000000000001a (FINI_ARRAY) 0x5a6f5b8 2025-05-07T20:24:51.8818192Z 0x000000000000001c (FINI_ARRAYSZ) 8 (bytes) 2025-05-07T20:24:51.8818517Z 0x000000006ffffef5 (GNU_HASH) 0x298 2025-05-07T20:24:51.8818814Z 0x0000000000000005 (STRTAB) 0xec18 2025-05-07T20:24:51.8819115Z 0x0000000000000006 (SYMTAB) 0x34e8 2025-05-07T20:24:51.8819439Z 0x000000000000000a (STRSZ) 237184 (bytes) 2025-05-07T20:24:51.8819762Z 0x000000000000000b (SYMENT) 24 (bytes) 2025-05-07T20:24:51.8820080Z 0x0000000000000003 (PLTGOT) 0x5a72000 2025-05-07T20:24:51.8820408Z 0x0000000000000002 (PLTRELSZ) 20184 (bytes) 2025-05-07T20:24:51.8820728Z 0x0000000000000014 (PLTREL) RELA 2025-05-07T20:24:51.8821020Z 0x0000000000000017 (JMPREL) 0x56938 2025-05-07T20:24:51.8821324Z 0x0000000000000007 (RELA) 0x49b40 2025-05-07T20:24:51.8821639Z 0x0000000000000008 (RELASZ) 52728 (bytes) 2025-05-07T20:24:51.8821965Z 0x0000000000000009 (RELAENT) 24 (bytes) 2025-05-07T20:24:51.8822284Z 0x000000006ffffffe (VERNEED) 0x499e0 2025-05-07T20:24:51.8822587Z 0x000000006fffffff (VERNEEDNUM) 6 2025-05-07T20:24:51.8822886Z 0x000000006ffffff0 (VERSYM) 0x48a98 2025-05-07T20:24:51.8823226Z 0x000000006ffffff9 (RELACOUNT) 735 2025-05-07T20:24:51.8823514Z 0x0000000000000000 (NULL) 0x0 2025-05-07T20:24:51.8823700Z 2025-05-07T20:24:51.8823805Z ################################################################################ 2025-05-07T20:24:51.8824026Z 2025-05-07T20:24:51.8824030Z 2025-05-07T20:24:51.8824134Z ################################################################################ 2025-05-07T20:24:51.8824739Z [CHECK] BUILT LIBRARY: ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.8825371Z [CHECK] Listing out library size: 2025-05-07T20:24:51.8825930Z + du -h --block-size=1M ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.8826400Z 2025-05-07T20:24:51.8826715Z 1 ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.8827137Z 2025-05-07T20:24:51.8827628Z [CHECK] Listing out the GLIBC versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.8828812Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so | grep GLIBC_ | sed 's/.*GLIBC_\([.0-9]*\).*/GLIBC_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.8829536Z 2025-05-07T20:24:51.8856492Z GLIBC_2.2.5 2025-05-07T20:24:51.8856698Z GLIBC_2.14 2025-05-07T20:24:51.8859812Z 2025-05-07T20:24:51.8859828Z 2025-05-07T20:24:51.8860360Z [CHECK] Listing out the GLIBCXX versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.8861560Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so | grep GLIBCXX_ | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.8862259Z 2025-05-07T20:24:51.8904923Z GLIBCXX_3.4 2025-05-07T20:24:51.8905155Z GLIBCXX_3.4.9 2025-05-07T20:24:51.8905344Z GLIBCXX_3.4.21 2025-05-07T20:24:51.8908425Z 2025-05-07T20:24:51.8908574Z 2025-05-07T20:24:51.8940600Z + nm -gDC ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so > /tmp/tmp.0DR2hR28PE.symbols.txt 2025-05-07T20:24:51.8941348Z 2025-05-07T20:24:51.8972823Z 2025-05-07T20:24:51.9014818Z [CHECK] Total Number of symbols: 153 2025-05-07T20:24:51.9036538Z [CHECK] Number of fbgemm symbols: 19 2025-05-07T20:24:51.9067151Z + nm -gDCu ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so > /tmp/tmp.0hYAsiADlW.usymbols.txt 2025-05-07T20:24:51.9067749Z 2025-05-07T20:24:51.9096181Z 2025-05-07T20:24:51.9136439Z [CHECK] Listing out undefined symbols (76 total): 2025-05-07T20:24:51.9167571Z U at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) 2025-05-07T20:24:51.9168336Z U at::_ops::to_dtype::call(at::Tensor const&, c10::ScalarType, bool, bool, std::optional) 2025-05-07T20:24:51.9169169Z U at::_ops::zeros::call(c10::ArrayRef, std::optional, std::optional, std::optional, std::optional) 2025-05-07T20:24:51.9178210Z U c10::detail::infer_schema::make_function_schema(c10::ArrayRef, c10::ArrayRef) 2025-05-07T20:24:51.9179168Z U c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) 2025-05-07T20:24:51.9179835Z U c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) 2025-05-07T20:24:51.9180359Z U c10::FloatType::get() 2025-05-07T20:24:51.9180690Z U c10::IValue::reportToTensorTypeError() const 2025-05-07T20:24:51.9181052Z U c10::MessageLogger::~MessageLogger() 2025-05-07T20:24:51.9181555Z U c10::MessageLogger::MessageLogger(char const*, int, int) 2025-05-07T20:24:51.9181973Z U c10::SymFloat::guard_float(char const*, long) const 2025-05-07T20:24:51.9182318Z U c10::TensorType::get() 2025-05-07T20:24:51.9182627Z U c10::UndefinedTensorImpl::_singleton 2025-05-07T20:24:51.9183030Z U caffe2::TypeMeta::error_unsupported_typemeta(caffe2::TypeMeta) 2025-05-07T20:24:51.9183428Z U cudaGetErrorString@libcudart.so.12 2025-05-07T20:24:51.9183759Z U cudaGetLastError@libcudart.so.12 2025-05-07T20:24:51.9184076Z U cudaLaunchKernel@libcudart.so.12 2025-05-07T20:24:51.9184425Z U __cudaPopCallConfiguration@libcudart.so.12 2025-05-07T20:24:51.9185147Z U __cudaPushCallConfiguration@libcudart.so.12 2025-05-07T20:24:51.9185527Z U __cudaRegisterFatBinaryEnd@libcudart.so.12 2025-05-07T20:24:51.9185891Z U __cudaRegisterFatBinary@libcudart.so.12 2025-05-07T20:24:51.9186229Z U __cudaRegisterFunction@libcudart.so.12 2025-05-07T20:24:51.9186559Z U __cudaRegisterVar@libcudart.so.12 2025-05-07T20:24:51.9186989Z U __cudaUnregisterFatBinary@libcudart.so.12 2025-05-07T20:24:51.9187327Z U __cxa_allocate_exception@CXXABI_1.3 2025-05-07T20:24:51.9187620Z U __cxa_atexit@GLIBC_2.2.5 2025-05-07T20:24:51.9187906Z U __cxa_free_exception@CXXABI_1.3 2025-05-07T20:24:51.9188193Z U __cxa_throw@CXXABI_1.3 2025-05-07T20:24:51.9188513Z U float* at::TensorBase::data_ptr() const 2025-05-07T20:24:51.9188856Z U __gxx_personality_v0@CXXABI_1.3 2025-05-07T20:24:51.9189231Z U long c10::detail::maybe_wrap_dim_slow(long, long, bool) 2025-05-07T20:24:51.9189595Z U memcpy@GLIBC_2.14 2025-05-07T20:24:51.9189861Z U memmove@GLIBC_2.2.5 2025-05-07T20:24:51.9190140Z U memset@GLIBC_2.2.5 2025-05-07T20:24:51.9190408Z U ncclCommDestroy 2025-05-07T20:24:51.9190656Z U ncclCommInitAll 2025-05-07T20:24:51.9190977Z U operator delete(void*, unsigned long)@CXXABI_1.3.9 2025-05-07T20:24:51.9191413Z U operator new(unsigned long)@GLIBCXX_3.4 2025-05-07T20:24:51.9191959Z U std::basic_ios >::clear(std::_Ios_Iostate)@GLIBCXX_3.4 2025-05-07T20:24:51.9192603Z U std::basic_ios >::init(std::basic_streambuf >*)@GLIBCXX_3.4 2025-05-07T20:24:51.9193619Z U std::basic_ostream >& std::__ostream_insert >(std::basic_ostream >&, char const*, long)@GLIBCXX_3.4.9 2025-05-07T20:24:51.9194617Z U std::__cxx11::basic_ostringstream, std::allocator >::~basic_ostringstream()@GLIBCXX_3.4.21 2025-05-07T20:24:51.9195187Z U std::ios_base::Init::~Init()@GLIBCXX_3.4 2025-05-07T20:24:51.9195525Z U std::ios_base::Init::Init()@GLIBCXX_3.4 2025-05-07T20:24:51.9195860Z U std::ios_base::~ios_base()@GLIBCXX_3.4 2025-05-07T20:24:51.9196176Z U std::ios_base::ios_base()@GLIBCXX_3.4 2025-05-07T20:24:51.9196493Z U std::locale::~locale()@GLIBCXX_3.4 2025-05-07T20:24:51.9196800Z U std::locale::locale()@GLIBCXX_3.4 2025-05-07T20:24:51.9197126Z U std::ostream::operator<<(int)@GLIBCXX_3.4 2025-05-07T20:24:51.9197514Z U std::ostream& std::ostream::_M_insert(long)@GLIBCXX_3.4.9 2025-05-07T20:24:51.9197928Z U std::runtime_error::~runtime_error()@GLIBCXX_3.4 2025-05-07T20:24:51.9198613Z U std::runtime_error::runtime_error(std::__cxx11::basic_string, std::allocator > const&)@GLIBCXX_3.4.21 2025-05-07T20:24:51.9199209Z U std::__throw_bad_alloc()@GLIBCXX_3.4 2025-05-07T20:24:51.9199557Z U std::__throw_length_error(char const*)@GLIBCXX_3.4 2025-05-07T20:24:51.9199939Z U std::__throw_logic_error(char const*)@GLIBCXX_3.4 2025-05-07T20:24:51.9200255Z U strlen@GLIBC_2.2.5 2025-05-07T20:24:51.9200553Z U torch::CppFunction::~CppFunction() 2025-05-07T20:24:51.9201100Z U torch::jit::parseSchema(std::__cxx11::basic_string, std::allocator > const&, bool) 2025-05-07T20:24:51.9201957Z U torch::Library::_def(c10::FunctionSchema&&, c10::OperatorName*, std::vector > const&, torch::_RegisterOrVerify) & 2025-05-07T20:24:51.9202716Z U torch::Library::_impl(char const*, torch::CppFunction&&, torch::_RegisterOrVerify) & 2025-05-07T20:24:51.9203601Z U torch::Library::Library(torch::Library::Kind, std::__cxx11::basic_string, std::allocator >, std::optional, char const*, unsigned int) 2025-05-07T20:24:51.9204394Z U typeinfo for std::runtime_error@GLIBCXX_3.4 2025-05-07T20:24:51.9204716Z U _Unwind_Resume@GCC_3.0 2025-05-07T20:24:51.9205041Z U vtable for __cxxabiv1::__class_type_info@CXXABI_1.3 2025-05-07T20:24:51.9205433Z U vtable for __cxxabiv1::__function_type_info@CXXABI_1.3 2025-05-07T20:24:51.9205831Z U vtable for __cxxabiv1::__si_class_type_info@CXXABI_1.3 2025-05-07T20:24:51.9206260Z U vtable for std::basic_ios >@GLIBCXX_3.4 2025-05-07T20:24:51.9206764Z U vtable for std::basic_streambuf >@GLIBCXX_3.4 2025-05-07T20:24:51.9207399Z U vtable for std::__cxx11::basic_ostringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.9208110Z U vtable for std::__cxx11::basic_stringbuf, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.9208817Z U VTT for std::__cxx11::basic_ostringstream, std::allocator >@GLIBCXX_3.4.21 2025-05-07T20:24:51.9209351Z w __cxa_finalize@GLIBC_2.2.5 2025-05-07T20:24:51.9209630Z w __gmon_start__ 2025-05-07T20:24:51.9209898Z w _ITM_deregisterTMCloneTable 2025-05-07T20:24:51.9210190Z w _ITM_registerTMCloneTable 2025-05-07T20:24:51.9210468Z w __pthread_key_create 2025-05-07T20:24:51.9210793Z [CHECK] Listing out external shared libraries linked: 2025-05-07T20:24:51.9211399Z + ldd ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.9211824Z 2025-05-07T20:24:51.9246362Z linux-vdso.so.1 (0x00007fff9b3a3000) 2025-05-07T20:24:51.9246661Z libtorch.so => not found 2025-05-07T20:24:51.9246937Z libc10.so => not found 2025-05-07T20:24:51.9247240Z libnvrtc.so.12 => not found 2025-05-07T20:24:51.9247494Z libc10_cuda.so => not found 2025-05-07T20:24:51.9247731Z libnccl.so.2 => not found 2025-05-07T20:24:51.9248048Z libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f83c1000000) 2025-05-07T20:24:51.9248488Z libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f83bfe00000) 2025-05-07T20:24:51.9248866Z libtorch_cpu.so => not found 2025-05-07T20:24:51.9249125Z libtorch_cuda.so => not found 2025-05-07T20:24:51.9249371Z libcudart.so.12 => not found 2025-05-07T20:24:51.9249701Z libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f83bfa6b000) 2025-05-07T20:24:51.9250110Z libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f83bf853000) 2025-05-07T20:24:51.9250488Z libc.so.6 => /usr/lib64/libc.so.6 (0x00007f83bf47d000) 2025-05-07T20:24:51.9250830Z libm.so.6 => /usr/lib64/libm.so.6 (0x00007f83bf0fb000) 2025-05-07T20:24:51.9251267Z libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f83beef7000) 2025-05-07T20:24:51.9251674Z libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f83becd7000) 2025-05-07T20:24:51.9252248Z librt.so.1 => /usr/lib64/librt.so.1 (0x00007f83beacf000) 2025-05-07T20:24:51.9252661Z /lib64/ld-linux-x86-64.so.2 (0x00007f83c550c000) 2025-05-07T20:24:51.9252887Z 2025-05-07T20:24:51.9253235Z [CHECK] Displaying ELF information: 2025-05-07T20:24:51.9253800Z + readelf -d ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:24:51.9254252Z 2025-05-07T20:24:51.9270391Z 2025-05-07T20:24:51.9270877Z Dynamic section at offset 0x72c20 contains 37 entries: 2025-05-07T20:24:51.9271242Z Tag Type Name/Value 2025-05-07T20:24:51.9271666Z 0x0000000000000001 (NEEDED) Shared library: [libtorch.so] 2025-05-07T20:24:51.9272303Z 0x0000000000000001 (NEEDED) Shared library: [libc10.so] 2025-05-07T20:24:51.9272771Z 0x0000000000000001 (NEEDED) Shared library: [libnvrtc.so.12] 2025-05-07T20:24:51.9273256Z 0x0000000000000001 (NEEDED) Shared library: [libc10_cuda.so] 2025-05-07T20:24:51.9274855Z 0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2] 2025-05-07T20:24:51.9275361Z 0x0000000000000001 (NEEDED) Shared library: [libcuda.so.1] 2025-05-07T20:24:51.9275859Z 0x0000000000000001 (NEEDED) Shared library: [libnvidia-ml.so.1] 2025-05-07T20:24:51.9276359Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cpu.so] 2025-05-07T20:24:51.9276843Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cuda.so] 2025-05-07T20:24:51.9277314Z 0x0000000000000001 (NEEDED) Shared library: [libcudart.so.12] 2025-05-07T20:24:51.9277794Z 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6] 2025-05-07T20:24:51.9278258Z 0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1] 2025-05-07T20:24:51.9278733Z 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 2025-05-07T20:24:51.9279267Z 0x000000000000000e (SONAME) Library soname: [fbgemm_gpu_experimental_example_py.so] 2025-05-07T20:24:51.9279728Z 0x000000000000000c (INIT) 0x5000 2025-05-07T20:24:51.9280185Z 0x000000000000000d (FINI) 0xab04 2025-05-07T20:24:51.9280483Z 0x0000000000000019 (INIT_ARRAY) 0x73b28 2025-05-07T20:24:51.9280802Z 0x000000000000001b (INIT_ARRAYSZ) 32 (bytes) 2025-05-07T20:24:51.9281116Z 0x000000000000001a (FINI_ARRAY) 0x73b48 2025-05-07T20:24:51.9281438Z 0x000000000000001c (FINI_ARRAYSZ) 8 (bytes) 2025-05-07T20:24:51.9281749Z 0x000000006ffffef5 (GNU_HASH) 0x260 2025-05-07T20:24:51.9282124Z 0x0000000000000005 (STRTAB) 0x1360 2025-05-07T20:24:51.9282431Z 0x0000000000000006 (SYMTAB) 0x4f0 2025-05-07T20:24:51.9282741Z 0x000000000000000a (STRSZ) 10416 (bytes) 2025-05-07T20:24:51.9283080Z 0x000000000000000b (SYMENT) 24 (bytes) 2025-05-07T20:24:51.9283386Z 0x0000000000000003 (PLTGOT) 0x74000 2025-05-07T20:24:51.9283713Z 0x0000000000000002 (PLTRELSZ) 1896 (bytes) 2025-05-07T20:24:51.9284028Z 0x0000000000000014 (PLTREL) RELA 2025-05-07T20:24:51.9284336Z 0x0000000000000017 (JMPREL) 0x44c0 2025-05-07T20:24:51.9293250Z 0x0000000000000007 (RELA) 0x3e18 2025-05-07T20:24:51.9293588Z 0x0000000000000008 (RELASZ) 1704 (bytes) 2025-05-07T20:24:51.9293913Z 0x0000000000000009 (RELAENT) 24 (bytes) 2025-05-07T20:24:51.9294222Z 0x000000006ffffffe (VERNEED) 0x3d48 2025-05-07T20:24:51.9294529Z 0x000000006fffffff (VERNEEDNUM) 4 2025-05-07T20:24:51.9294834Z 0x000000006ffffff0 (VERSYM) 0x3c10 2025-05-07T20:24:51.9295149Z 0x000000006ffffff9 (RELACOUNT) 7 2025-05-07T20:24:51.9295442Z 0x0000000000000000 (NULL) 0x0 2025-05-07T20:24:51.9295668Z 2025-05-07T20:24:51.9295914Z ################################################################################ 2025-05-07T20:24:51.9296133Z 2025-05-07T20:24:51.9296137Z 2025-05-07T20:24:51.9296253Z ################################################################################ 2025-05-07T20:24:51.9296658Z [CHECK] BUILT LIBRARY: ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9297057Z [CHECK] Listing out library size: 2025-05-07T20:24:51.9297422Z + du -h --block-size=1M ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9297715Z 2025-05-07T20:24:51.9300337Z 1 ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9303016Z 2025-05-07T20:24:51.9303696Z [CHECK] Listing out the GLIBC versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9304502Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so | grep GLIBC_ | sed 's/.*GLIBC_\([.0-9]*\).*/GLIBC_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.9305036Z 2025-05-07T20:24:51.9358530Z GLIBC_2.2.5 2025-05-07T20:24:51.9358738Z GLIBC_2.14 2025-05-07T20:24:51.9361757Z 2025-05-07T20:24:51.9361952Z 2025-05-07T20:24:51.9362361Z [CHECK] Listing out the GLIBCXX versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9363334Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so | grep GLIBCXX_ | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.9363836Z 2025-05-07T20:24:51.9415424Z 2025-05-07T20:24:51.9415436Z 2025-05-07T20:24:51.9448303Z + nm -gDC ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so > /tmp/tmp.bySTbCm3ep.symbols.txt 2025-05-07T20:24:51.9448697Z 2025-05-07T20:24:51.9510846Z 2025-05-07T20:24:51.9552659Z [CHECK] Total Number of symbols: 801 2025-05-07T20:24:51.9574068Z [CHECK] Number of fbgemm symbols: 0 2025-05-07T20:24:51.9610971Z + nm -gDCu ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so > /tmp/tmp.OjuqkNPz9q.usymbols.txt 2025-05-07T20:24:51.9611429Z 2025-05-07T20:24:51.9641236Z 2025-05-07T20:24:51.9682388Z [CHECK] Listing out undefined symbols (49 total): 2025-05-07T20:24:51.9711233Z U abort@GLIBC_2.2.5 2025-05-07T20:24:51.9711509Z U close@GLIBC_2.2.5 2025-05-07T20:24:51.9711912Z U __cxa_guard_acquire@CXXABI_1.3 2025-05-07T20:24:51.9712390Z U __cxa_guard_release@CXXABI_1.3 2025-05-07T20:24:51.9712688Z U __errno_location@GLIBC_2.2.5 2025-05-07T20:24:51.9712965Z U fputs@GLIBC_2.2.5 2025-05-07T20:24:51.9713227Z U free@GLIBC_2.2.5 2025-05-07T20:24:51.9713489Z U ftruncate64@GLIBC_2.2.5 2025-05-07T20:24:51.9713767Z U fwrite@GLIBC_2.2.5 2025-05-07T20:24:51.9714025Z U getenv@GLIBC_2.2.5 2025-05-07T20:24:51.9714381Z U getpagesize@GLIBC_2.2.5 2025-05-07T20:24:51.9714685Z U __gxx_personality_v0@CXXABI_1.3 2025-05-07T20:24:51.9714975Z U madvise@GLIBC_2.2.5 2025-05-07T20:24:51.9715249Z U malloc@GLIBC_2.2.5 2025-05-07T20:24:51.9715507Z U memcmp@GLIBC_2.2.5 2025-05-07T20:24:51.9715772Z U memcpy@GLIBC_2.14 2025-05-07T20:24:51.9716032Z U memmove@GLIBC_2.2.5 2025-05-07T20:24:51.9716299Z U memset@GLIBC_2.2.5 2025-05-07T20:24:51.9716559Z U mmap@GLIBC_2.2.5 2025-05-07T20:24:51.9716818Z U mprotect@GLIBC_2.2.5 2025-05-07T20:24:51.9717085Z U munmap@GLIBC_2.2.5 2025-05-07T20:24:51.9717356Z U open64@GLIBC_2.2.5 2025-05-07T20:24:51.9717675Z U operator delete(void*, unsigned long)@CXXABI_1.3.9 2025-05-07T20:24:51.9718011Z U __popcountdi2@GCC_3.4 2025-05-07T20:24:51.9718314Z U pthread_mutex_destroy@GLIBC_2.2.5 2025-05-07T20:24:51.9718620Z U pthread_mutex_lock@GLIBC_2.2.5 2025-05-07T20:24:51.9718930Z U pthread_mutex_unlock@GLIBC_2.2.5 2025-05-07T20:24:51.9719294Z U read@GLIBC_2.2.5 2025-05-07T20:24:51.9719559Z U realloc@GLIBC_2.2.5 2025-05-07T20:24:51.9719814Z U shm_open 2025-05-07T20:24:51.9720048Z U shm_unlink 2025-05-07T20:24:51.9720300Z U snprintf@GLIBC_2.2.5 2025-05-07T20:24:51.9720571Z U stderr@GLIBC_2.2.5 2025-05-07T20:24:51.9720835Z U strcmp@GLIBC_2.2.5 2025-05-07T20:24:51.9721093Z U strlen@GLIBC_2.2.5 2025-05-07T20:24:51.9721356Z U strtol@GLIBC_2.2.5 2025-05-07T20:24:51.9721612Z U syscall@GLIBC_2.2.5 2025-05-07T20:24:51.9721879Z U sysconf@GLIBC_2.2.5 2025-05-07T20:24:51.9722136Z U uname@GLIBC_2.2.5 2025-05-07T20:24:51.9722407Z U unlink@GLIBC_2.2.5 2025-05-07T20:24:51.9722675Z U _Unwind_Resume@GCC_3.0 2025-05-07T20:24:51.9722950Z U vsnprintf@GLIBC_2.2.5 2025-05-07T20:24:51.9723301Z U vtable for __cxxabiv1::__class_type_info@CXXABI_1.3 2025-05-07T20:24:51.9723695Z U vtable for __cxxabiv1::__si_class_type_info@CXXABI_1.3 2025-05-07T20:24:51.9724105Z U vtable for __cxxabiv1::__vmi_class_type_info@CXXABI_1.3 2025-05-07T20:24:51.9724537Z w __cxa_finalize@GLIBC_2.2.5 2025-05-07T20:24:51.9724817Z w __gmon_start__ 2025-05-07T20:24:51.9725090Z w _ITM_deregisterTMCloneTable 2025-05-07T20:24:51.9725383Z w _ITM_registerTMCloneTable 2025-05-07T20:24:51.9725722Z [CHECK] Listing out external shared libraries linked: 2025-05-07T20:24:51.9726093Z + ldd ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9726336Z 2025-05-07T20:24:51.9788243Z linux-vdso.so.1 (0x00007ffe7a57f000) 2025-05-07T20:24:51.9788554Z libtorch.so => not found 2025-05-07T20:24:51.9788779Z libc10.so => not found 2025-05-07T20:24:51.9789002Z libnvrtc.so.12 => not found 2025-05-07T20:24:51.9789237Z libc10_cuda.so => not found 2025-05-07T20:24:51.9789477Z libnccl.so.2 => not found 2025-05-07T20:24:51.9789776Z libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f1bfd800000) 2025-05-07T20:24:51.9790217Z libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f1bfc600000) 2025-05-07T20:24:51.9790594Z libtorch_cpu.so => not found 2025-05-07T20:24:51.9790972Z libtorch_cuda.so => not found 2025-05-07T20:24:51.9791294Z libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f1bfc26b000) 2025-05-07T20:24:51.9791710Z libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f1bfc053000) 2025-05-07T20:24:51.9792212Z libc.so.6 => /usr/lib64/libc.so.6 (0x00007f1bfbc7d000) 2025-05-07T20:24:51.9792558Z libm.so.6 => /usr/lib64/libm.so.6 (0x00007f1bfb8fb000) 2025-05-07T20:24:51.9792916Z libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f1bfb6f7000) 2025-05-07T20:24:51.9793379Z libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f1bfb4d7000) 2025-05-07T20:24:51.9793792Z librt.so.1 => /usr/lib64/librt.so.1 (0x00007f1bfb2cf000) 2025-05-07T20:24:51.9794139Z /lib64/ld-linux-x86-64.so.2 (0x00007f1c01e0e000) 2025-05-07T20:24:51.9794368Z 2025-05-07T20:24:51.9794501Z [CHECK] Displaying ELF information: 2025-05-07T20:24:51.9794882Z + readelf -d ./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so 2025-05-07T20:24:51.9795168Z 2025-05-07T20:24:51.9810762Z 2025-05-07T20:24:51.9811002Z Dynamic section at offset 0x758e0 contains 36 entries: 2025-05-07T20:24:51.9811358Z Tag Type Name/Value 2025-05-07T20:24:51.9811752Z 0x0000000000000001 (NEEDED) Shared library: [libtorch.so] 2025-05-07T20:24:51.9812205Z 0x0000000000000001 (NEEDED) Shared library: [libc10.so] 2025-05-07T20:24:51.9812684Z 0x0000000000000001 (NEEDED) Shared library: [libnvrtc.so.12] 2025-05-07T20:24:51.9813158Z 0x0000000000000001 (NEEDED) Shared library: [libc10_cuda.so] 2025-05-07T20:24:51.9813626Z 0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2] 2025-05-07T20:24:51.9814167Z 0x0000000000000001 (NEEDED) Shared library: [libcuda.so.1] 2025-05-07T20:24:51.9814651Z 0x0000000000000001 (NEEDED) Shared library: [libnvidia-ml.so.1] 2025-05-07T20:24:51.9815185Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cpu.so] 2025-05-07T20:24:51.9815664Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cuda.so] 2025-05-07T20:24:51.9816146Z 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6] 2025-05-07T20:24:51.9816608Z 0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1] 2025-05-07T20:24:51.9817069Z 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 2025-05-07T20:24:51.9817521Z 0x000000000000000e (SONAME) Library soname: [asmjit.so] 2025-05-07T20:24:51.9817898Z 0x000000000000000c (INIT) 0x17000 2025-05-07T20:24:51.9818206Z 0x000000000000000d (FINI) 0x57efc 2025-05-07T20:24:51.9818504Z 0x0000000000000019 (INIT_ARRAY) 0x75b00 2025-05-07T20:24:51.9818819Z 0x000000000000001b (INIT_ARRAYSZ) 8 (bytes) 2025-05-07T20:24:51.9819134Z 0x000000000000001a (FINI_ARRAY) 0x75b08 2025-05-07T20:24:51.9819441Z 0x000000000000001c (FINI_ARRAYSZ) 8 (bytes) 2025-05-07T20:24:51.9819815Z 0x000000006ffffef5 (GNU_HASH) 0x260 2025-05-07T20:24:51.9820115Z 0x0000000000000005 (STRTAB) 0x6388 2025-05-07T20:24:51.9820418Z 0x0000000000000006 (SYMTAB) 0x1858 2025-05-07T20:24:51.9820731Z 0x000000000000000a (STRSZ) 45396 (bytes) 2025-05-07T20:24:51.9821059Z 0x000000000000000b (SYMENT) 24 (bytes) 2025-05-07T20:24:51.9821366Z 0x0000000000000003 (PLTGOT) 0x77000 2025-05-07T20:24:51.9821689Z 0x0000000000000002 (PLTRELSZ) 8208 (bytes) 2025-05-07T20:24:51.9822006Z 0x0000000000000014 (PLTREL) RELA 2025-05-07T20:24:51.9822296Z 0x0000000000000017 (JMPREL) 0x14ef8 2025-05-07T20:24:51.9822597Z 0x0000000000000007 (RELA) 0x11bb0 2025-05-07T20:24:51.9822913Z 0x0000000000000008 (RELASZ) 13128 (bytes) 2025-05-07T20:24:51.9823239Z 0x0000000000000009 (RELAENT) 24 (bytes) 2025-05-07T20:24:51.9823549Z 0x000000006ffffffe (VERNEED) 0x11b20 2025-05-07T20:24:51.9823859Z 0x000000006fffffff (VERNEEDNUM) 3 2025-05-07T20:24:51.9824200Z 0x000000006ffffff0 (VERSYM) 0x114dc 2025-05-07T20:24:51.9824505Z 0x000000006ffffff9 (RELACOUNT) 3 2025-05-07T20:24:51.9824789Z 0x0000000000000000 (NULL) 0x0 2025-05-07T20:24:51.9824983Z 2025-05-07T20:24:51.9825093Z ################################################################################ 2025-05-07T20:24:51.9825305Z 2025-05-07T20:24:51.9825309Z 2025-05-07T20:24:51.9825453Z ################################################################################ 2025-05-07T20:24:51.9825850Z [CHECK] BUILT LIBRARY: ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:51.9826243Z [CHECK] Listing out library size: 2025-05-07T20:24:51.9826615Z + du -h --block-size=1M ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:51.9826900Z 2025-05-07T20:24:51.9840519Z 6 ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:51.9843297Z 2025-05-07T20:24:51.9845109Z [CHECK] Listing out the GLIBC versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:51.9845900Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so | grep GLIBC_ | sed 's/.*GLIBC_\([.0-9]*\).*/GLIBC_\1/g' | sort -Vu | cat 2025-05-07T20:24:51.9846378Z 2025-05-07T20:24:52.0140563Z GLIBC_2.2.5 2025-05-07T20:24:52.0140781Z GLIBC_2.3 2025-05-07T20:24:52.0141394Z GLIBC_2.14 2025-05-07T20:24:52.0141606Z GLIBC_2.27 2025-05-07T20:24:52.0143493Z 2025-05-07T20:24:52.0143612Z 2025-05-07T20:24:52.0144098Z [CHECK] Listing out the GLIBCXX versions referenced by: ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:52.0145122Z + objdump -TC ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so | grep GLIBCXX_ | sed 's/.*GLIBCXX_\([.0-9]*\).*/GLIBCXX_\1/g' | sort -Vu | cat 2025-05-07T20:24:52.0145634Z 2025-05-07T20:24:52.0428719Z GLIBCXX_3.4 2025-05-07T20:24:52.0428943Z GLIBCXX_3.4.9 2025-05-07T20:24:52.0429133Z GLIBCXX_3.4.11 2025-05-07T20:24:52.0429331Z GLIBCXX_3.4.14 2025-05-07T20:24:52.0429517Z GLIBCXX_3.4.15 2025-05-07T20:24:52.0429716Z GLIBCXX_3.4.18 2025-05-07T20:24:52.0429904Z GLIBCXX_3.4.21 2025-05-07T20:24:52.0432288Z 2025-05-07T20:24:52.0432424Z 2025-05-07T20:24:52.0465239Z + nm -gDC ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so > /tmp/tmp.vhB1ID5bpd.symbols.txt 2025-05-07T20:24:52.0465625Z 2025-05-07T20:24:52.1051633Z 2025-05-07T20:24:52.1097869Z [CHECK] Total Number of symbols: 4865 2025-05-07T20:24:52.1126152Z [CHECK] Number of fbgemm symbols: 3361 2025-05-07T20:24:52.1156731Z + nm -gDCu ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so > /tmp/tmp.0SsiJMJbl9.usymbols.txt 2025-05-07T20:24:52.1157510Z 2025-05-07T20:24:52.1198381Z 2025-05-07T20:24:52.1238923Z [CHECK] Listing out undefined symbols (135 total): 2025-05-07T20:24:52.1270430Z U abort@GLIBC_2.2.5 2025-05-07T20:24:52.1270868Z U asmjit::_abi_1_13::BaseAssembler::bind(asmjit::_abi_1_13::Label const&) 2025-05-07T20:24:52.1271461Z U asmjit::_abi_1_13::BaseAssembler::newLabel() 2025-05-07T20:24:52.1272210Z U asmjit::_abi_1_13::BaseEmitter::emitArgsAssignment(asmjit::_abi_1_13::FuncFrame const&, asmjit::_abi_1_13::FuncArgsAssignment const&) 2025-05-07T20:24:52.1273051Z U asmjit::_abi_1_13::BaseEmitter::emitEpilog(asmjit::_abi_1_13::FuncFrame const&) 2025-05-07T20:24:52.1273667Z U asmjit::_abi_1_13::BaseEmitter::_emitI(unsigned int, asmjit::_abi_1_13::Operand_ const&) 2025-05-07T20:24:52.1274422Z U asmjit::_abi_1_13::BaseEmitter::_emitI(unsigned int, asmjit::_abi_1_13::Operand_ const&, asmjit::_abi_1_13::Operand_ const&) 2025-05-07T20:24:52.1275354Z U asmjit::_abi_1_13::BaseEmitter::_emitI(unsigned int, asmjit::_abi_1_13::Operand_ const&, asmjit::_abi_1_13::Operand_ const&, asmjit::_abi_1_13::Operand_ const&) 2025-05-07T20:24:52.1276425Z U asmjit::_abi_1_13::BaseEmitter::_emitI(unsigned int, asmjit::_abi_1_13::Operand_ const&, asmjit::_abi_1_13::Operand_ const&, asmjit::_abi_1_13::Operand_ const&, asmjit::_abi_1_13::Operand_ const&) 2025-05-07T20:24:52.1277401Z U asmjit::_abi_1_13::BaseEmitter::emitProlog(asmjit::_abi_1_13::FuncFrame const&) 2025-05-07T20:24:52.1277850Z U asmjit::_abi_1_13::CodeHolder::~CodeHolder() 2025-05-07T20:24:52.1278336Z U asmjit::_abi_1_13::CodeHolder::CodeHolder(asmjit::_abi_1_13::Support::Temporary const*) 2025-05-07T20:24:52.1279002Z U asmjit::_abi_1_13::CodeHolder::init(asmjit::_abi_1_13::Environment const&, unsigned long) 2025-05-07T20:24:52.1279628Z U asmjit::_abi_1_13::FuncArgsAssignment::updateFuncFrame(asmjit::_abi_1_13::FuncFrame&) const 2025-05-07T20:24:52.1280329Z U asmjit::_abi_1_13::FuncDetail::init(asmjit::_abi_1_13::FuncSignature const&, asmjit::_abi_1_13::Environment const&) 2025-05-07T20:24:52.1280868Z U asmjit::_abi_1_13::FuncFrame::finalize() 2025-05-07T20:24:52.1281283Z U asmjit::_abi_1_13::FuncFrame::init(asmjit::_abi_1_13::FuncDetail const&) 2025-05-07T20:24:52.1281782Z U asmjit::_abi_1_13::JitRuntime::_add(void**, asmjit::_abi_1_13::CodeHolder*) 2025-05-07T20:24:52.1282203Z U asmjit::_abi_1_13::JitRuntime::~JitRuntime() 2025-05-07T20:24:52.1282706Z U asmjit::_abi_1_13::JitRuntime::JitRuntime(asmjit::_abi_1_13::JitAllocator::CreateParams const*) 2025-05-07T20:24:52.1283208Z U asmjit::_abi_1_13::x86::Assembler::~Assembler() 2025-05-07T20:24:52.1283639Z U asmjit::_abi_1_13::x86::Assembler::Assembler(asmjit::_abi_1_13::CodeHolder*) 2025-05-07T20:24:52.1284032Z U cpuinfo_get_packages 2025-05-07T20:24:52.1284322Z U cpuinfo_get_packages_count 2025-05-07T20:24:52.1284691Z U cpuinfo_initialize 2025-05-07T20:24:52.1285218Z U cpuinfo_isa 2025-05-07T20:24:52.1285492Z U __cxa_allocate_exception@CXXABI_1.3 2025-05-07T20:24:52.1285801Z U __cxa_atexit@GLIBC_2.2.5 2025-05-07T20:24:52.1286095Z U __cxa_begin_catch@CXXABI_1.3 2025-05-07T20:24:52.1286384Z U __cxa_end_catch@CXXABI_1.3 2025-05-07T20:24:52.1286681Z U __cxa_free_exception@CXXABI_1.3 2025-05-07T20:24:52.1286984Z U __cxa_guard_abort@CXXABI_1.3 2025-05-07T20:24:52.1287274Z U __cxa_guard_acquire@CXXABI_1.3 2025-05-07T20:24:52.1287573Z U __cxa_guard_release@CXXABI_1.3 2025-05-07T20:24:52.1287897Z U __cxa_init_primary_exception@CXXABI_1.3.11 2025-05-07T20:24:52.1288216Z U __cxa_rethrow@CXXABI_1.3 2025-05-07T20:24:52.1288502Z U __cxa_thread_atexit@CXXABI_1.3.7 2025-05-07T20:24:52.1288837Z U __cxa_throw_bad_array_new_length@CXXABI_1.3.8 2025-05-07T20:24:52.1289150Z U __cxa_throw@CXXABI_1.3 2025-05-07T20:24:52.1289422Z U fmaf@GLIBC_2.2.5 2025-05-07T20:24:52.1289766Z U fma@GLIBC_2.2.5 2025-05-07T20:24:52.1290017Z U fminf@GLIBC_2.2.5 2025-05-07T20:24:52.1290275Z U free@GLIBC_2.2.5 2025-05-07T20:24:52.1290524Z U fwrite@GLIBC_2.2.5 2025-05-07T20:24:52.1290789Z U getenv@GLIBC_2.2.5 2025-05-07T20:24:52.1291062Z U __gxx_personality_v0@CXXABI_1.3 2025-05-07T20:24:52.1291351Z U log2f@GLIBC_2.27 2025-05-07T20:24:52.1291598Z U log2@GLIBC_2.2.5 2025-05-07T20:24:52.1291855Z U lrintf@GLIBC_2.2.5 2025-05-07T20:24:52.1292118Z U memcmp@GLIBC_2.2.5 2025-05-07T20:24:52.1292375Z U memcpy@GLIBC_2.14 2025-05-07T20:24:52.1292633Z U memmove@GLIBC_2.2.5 2025-05-07T20:24:52.1292895Z U memset@GLIBC_2.2.5 2025-05-07T20:24:52.1293166Z U nearbyintf@GLIBC_2.2.5 2025-05-07T20:24:52.1293437Z U nearbyint@GLIBC_2.2.5 2025-05-07T20:24:52.1293727Z U __once_proxy@GLIBCXX_3.4.11 2025-05-07T20:24:52.1294095Z U operator delete[](void*)@GLIBCXX_3.4 2025-05-07T20:24:52.1294449Z U operator delete(void*, unsigned long)@CXXABI_1.3.9 2025-05-07T20:24:52.1294814Z U operator new(unsigned long)@GLIBCXX_3.4 2025-05-07T20:24:52.1295174Z U operator new[](unsigned long)@GLIBCXX_3.4 2025-05-07T20:24:52.1295494Z U posix_memalign@GLIBC_2.2.5 2025-05-07T20:24:52.1295769Z U pow@GLIBC_2.2.5 2025-05-07T20:24:52.1296092Z U pthread_self@GLIBC_2.2.5 2025-05-07T20:24:52.1296370Z U sqrtf@GLIBC_2.2.5 2025-05-07T20:24:52.1296784Z U std::__atomic_futex_unsigned_base::_M_futex_notify_all(unsigned int*)@GLIBCXX_3.4.21 2025-05-07T20:24:52.1297729Z U std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono::duration >, std::chrono::duration >)@GLIBCXX_3.4.21 2025-05-07T20:24:52.1298568Z U std::bad_alloc::~bad_alloc()@GLIBCXX_3.4 2025-05-07T20:24:52.1299305Z U std::basic_ostream >& std::__ostream_insert >(std::basic_ostream >&, char const*, long)@GLIBCXX_3.4.9 2025-05-07T20:24:52.1300011Z U std::cerr@GLIBCXX_3.4 2025-05-07T20:24:52.1300280Z U std::cout@GLIBCXX_3.4 2025-05-07T20:24:52.1300615Z U std::ctype::_M_widen_init() const@GLIBCXX_3.4.11 2025-05-07T20:24:52.1301210Z U std::__detail::_Prime_rehash_policy::_M_need_rehash(unsigned long, unsigned long, unsigned long) const@GLIBCXX_3.4.18 2025-05-07T20:24:52.1301938Z U std::__detail::_Prime_rehash_policy::_M_next_bkt(unsigned long) const@GLIBCXX_3.4.18 2025-05-07T20:24:52.1302369Z U stderr@GLIBC_2.2.5 2025-05-07T20:24:52.1302763Z U std::__exception_ptr::exception_ptr::exception_ptr(void*)@CXXABI_1.3.11 2025-05-07T20:24:52.1303205Z U std::__exception_ptr::exception_ptr::_M_addref() 2025-05-07T20:24:52.1303581Z U std::__exception_ptr::exception_ptr::_M_release() 2025-05-07T20:24:52.1303983Z U std::__future_base::_Result_base::_Result_base()@GLIBCXX_3.4.15 2025-05-07T20:24:52.1304425Z U std::__future_base::_Result_base::~_Result_base()@GLIBCXX_3.4.15 2025-05-07T20:24:52.1304815Z U std::future_category()@GLIBCXX_3.4.15 2025-05-07T20:24:52.1305163Z U std::future_error::~future_error()@GLIBCXX_3.4.14 2025-05-07T20:24:52.1305601Z U std::_Hash_bytes(void const*, unsigned long, unsigned long)@CXXABI_1.3.5 2025-05-07T20:24:52.1306017Z U std::ios_base::Init::~Init()@GLIBCXX_3.4 2025-05-07T20:24:52.1306350Z U std::ios_base::Init::Init()@GLIBCXX_3.4 2025-05-07T20:24:52.1306985Z U std::logic_error::logic_error(std::__cxx11::basic_string, std::allocator > const&)@GLIBCXX_3.4.21 2025-05-07T20:24:52.1307652Z U std::logic_error::logic_error(std::logic_error const&)@GLIBCXX_3.4.21 2025-05-07T20:24:52.1308068Z U std::__once_callable@GLIBCXX_3.4.11 2025-05-07T20:24:52.1308374Z U std::__once_call@GLIBCXX_3.4.11 2025-05-07T20:24:52.1308687Z U std::ostream::flush()@GLIBCXX_3.4 2025-05-07T20:24:52.1309009Z U std::ostream::operator<<(int)@GLIBCXX_3.4 2025-05-07T20:24:52.1309342Z U std::ostream::put(char)@GLIBCXX_3.4 2025-05-07T20:24:52.1309745Z U std::ostream& std::ostream::_M_insert(double)@GLIBCXX_3.4.9 2025-05-07T20:24:52.1310203Z U std::ostream& std::ostream::_M_insert(long)@GLIBCXX_3.4.9 2025-05-07T20:24:52.1310706Z U std::ostream& std::ostream::_M_insert(unsigned long)@GLIBCXX_3.4.9 2025-05-07T20:24:52.1311251Z U std::_Rb_tree_decrement(std::_Rb_tree_node_base*)@GLIBCXX_3.4 2025-05-07T20:24:52.1311675Z U std::_Rb_tree_increment(std::_Rb_tree_node_base*)@GLIBCXX_3.4 2025-05-07T20:24:52.1312416Z U std::_Rb_tree_insert_and_rebalance(bool, std::_Rb_tree_node_base*, std::_Rb_tree_node_base*, std::_Rb_tree_node_base&)@GLIBCXX_3.4 2025-05-07T20:24:52.1313075Z U std::rethrow_exception(std::__exception_ptr::exception_ptr)@CXXABI_1.3.3 2025-05-07T20:24:52.1313607Z U std::runtime_error::runtime_error(char const*)@GLIBCXX_3.4.21 2025-05-07T20:24:52.1314018Z U std::runtime_error::~runtime_error()@GLIBCXX_3.4 2025-05-07T20:24:52.1314355Z U std::terminate()@GLIBCXX_3.4 2025-05-07T20:24:52.1314669Z U std::__throw_bad_alloc()@GLIBCXX_3.4 2025-05-07T20:24:52.1323864Z U std::__throw_bad_array_new_length() 2025-05-07T20:24:52.1324236Z U std::__throw_bad_cast()@GLIBCXX_3.4 2025-05-07T20:24:52.1324589Z U std::__throw_bad_function_call()@GLIBCXX_3.4.14 2025-05-07T20:24:52.1324965Z U std::__throw_future_error(int)@GLIBCXX_3.4.14 2025-05-07T20:24:52.1325365Z U std::__throw_length_error(char const*)@GLIBCXX_3.4 2025-05-07T20:24:52.1325745Z U std::__throw_logic_error(char const*)@GLIBCXX_3.4 2025-05-07T20:24:52.1326106Z U std::__throw_system_error(int)@GLIBCXX_3.4.11 2025-05-07T20:24:52.1326431Z U strcmp@GLIBC_2.2.5 2025-05-07T20:24:52.1326700Z U strlen@GLIBC_2.2.5 2025-05-07T20:24:52.1326968Z U strstr@GLIBC_2.2.5 2025-05-07T20:24:52.1327308Z U __tls_get_addr@GLIBC_2.3 2025-05-07T20:24:52.1327596Z U tolower@GLIBC_2.2.5 2025-05-07T20:24:52.1327864Z U toupper@GLIBC_2.2.5 2025-05-07T20:24:52.1328172Z U typeinfo for std::bad_alloc@GLIBCXX_3.4 2025-05-07T20:24:52.1328570Z U typeinfo for std::__future_base::_Result_base@GLIBCXX_3.4.15 2025-05-07T20:24:52.1328977Z U typeinfo for std::future_error@GLIBCXX_3.4.14 2025-05-07T20:24:52.1329344Z U typeinfo for std::runtime_error@GLIBCXX_3.4 2025-05-07T20:24:52.1329657Z U _Unwind_Resume@GCC_3.0 2025-05-07T20:24:52.1329994Z U vtable for __cxxabiv1::__class_type_info@CXXABI_1.3 2025-05-07T20:24:52.1330384Z U vtable for __cxxabiv1::__si_class_type_info@CXXABI_1.3 2025-05-07T20:24:52.1330756Z U vtable for std::bad_alloc@GLIBCXX_3.4 2025-05-07T20:24:52.1331093Z U vtable for std::future_error@GLIBCXX_3.4.14 2025-05-07T20:24:52.1331414Z w __cxa_finalize@GLIBC_2.2.5 2025-05-07T20:24:52.1331693Z w __gmon_start__ 2025-05-07T20:24:52.1331961Z w _ITM_deregisterTMCloneTable 2025-05-07T20:24:52.1332311Z w _ITM_registerTMCloneTable 2025-05-07T20:24:52.1332592Z w __pthread_key_create 2025-05-07T20:24:52.1332883Z w pthread_mutex_lock@GLIBC_2.2.5 2025-05-07T20:24:52.1333195Z w pthread_mutex_unlock@GLIBC_2.2.5 2025-05-07T20:24:52.1333486Z w pthread_once 2025-05-07T20:24:52.1333746Z w pthread_rwlock_rdlock 2025-05-07T20:24:52.1334021Z w pthread_rwlock_unlock 2025-05-07T20:24:52.1334302Z w pthread_rwlock_wrlock 2025-05-07T20:24:52.1334628Z [CHECK] Listing out external shared libraries linked: 2025-05-07T20:24:52.1335005Z + ldd ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:52.1335237Z 2025-05-07T20:24:52.1353110Z linux-vdso.so.1 (0x00007fffd41eb000) 2025-05-07T20:24:52.1353419Z libc10.so => not found 2025-05-07T20:24:52.1353658Z libnvrtc.so.12 => not found 2025-05-07T20:24:52.1353900Z libc10_cuda.so => not found 2025-05-07T20:24:52.1354149Z libnccl.so.2 => not found 2025-05-07T20:24:52.1354455Z libcuda.so.1 => /usr/lib64/libcuda.so.1 (0x00007f32f9200000) 2025-05-07T20:24:52.1355252Z asmjit.so => /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/./_skbuild/linux-x86_64-3.9/cmake-build/asmjit.so (0x00007f32fe014000) 2025-05-07T20:24:52.1355946Z libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f32f8000000) 2025-05-07T20:24:52.1356314Z libtorch.so => not found 2025-05-07T20:24:52.1356558Z libtorch_cpu.so => not found 2025-05-07T20:24:52.1356807Z libtorch_cuda.so => not found 2025-05-07T20:24:52.1357195Z libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00007f32f7c6b000) 2025-05-07T20:24:52.1357671Z libm.so.6 => /usr/lib64/libm.so.6 (0x00007f32f78e9000) 2025-05-07T20:24:52.1358050Z libgcc_s.so.1 => /usr/lib64/libgcc_s.so.1 (0x00007f32f76d1000) 2025-05-07T20:24:52.1358501Z libc.so.6 => /usr/lib64/libc.so.6 (0x00007f32f72fb000) 2025-05-07T20:24:52.1358853Z /lib64/ld-linux-x86-64.so.2 (0x00007f32fde64000) 2025-05-07T20:24:52.1359211Z libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f32f70f7000) 2025-05-07T20:24:52.1359620Z libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f32f6ed7000) 2025-05-07T20:24:52.1360035Z librt.so.1 => /usr/lib64/librt.so.1 (0x00007f32f6ccf000) 2025-05-07T20:24:52.1360347Z libtorch.so => not found 2025-05-07T20:24:52.1360575Z libc10.so => not found 2025-05-07T20:24:52.1360807Z libnvrtc.so.12 => not found 2025-05-07T20:24:52.1361047Z libc10_cuda.so => not found 2025-05-07T20:24:52.1361290Z libnccl.so.2 => not found 2025-05-07T20:24:52.1361527Z libtorch_cpu.so => not found 2025-05-07T20:24:52.1361784Z libtorch_cuda.so => not found 2025-05-07T20:24:52.1361958Z 2025-05-07T20:24:52.1362067Z [CHECK] Displaying ELF information: 2025-05-07T20:24:52.1362410Z + readelf -d ./_skbuild/linux-x86_64-3.9/cmake-build/fbgemm.so 2025-05-07T20:24:52.1362725Z 2025-05-07T20:24:52.1382790Z 2025-05-07T20:24:52.1383446Z Dynamic section at offset 0x517ac0 contains 40 entries: 2025-05-07T20:24:52.1383830Z Tag Type Name/Value 2025-05-07T20:24:52.1384266Z 0x0000000000000001 (NEEDED) Shared library: [libc10.so] 2025-05-07T20:24:52.1384751Z 0x0000000000000001 (NEEDED) Shared library: [libnvrtc.so.12] 2025-05-07T20:24:52.1385549Z 0x0000000000000001 (NEEDED) Shared library: [libc10_cuda.so] 2025-05-07T20:24:52.1386019Z 0x0000000000000001 (NEEDED) Shared library: [libnccl.so.2] 2025-05-07T20:24:52.1386475Z 0x0000000000000001 (NEEDED) Shared library: [libcuda.so.1] 2025-05-07T20:24:52.1386943Z 0x0000000000000001 (NEEDED) Shared library: [asmjit.so] 2025-05-07T20:24:52.1387416Z 0x0000000000000001 (NEEDED) Shared library: [libnvidia-ml.so.1] 2025-05-07T20:24:52.1387884Z 0x0000000000000001 (NEEDED) Shared library: [libtorch.so] 2025-05-07T20:24:52.1388356Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cpu.so] 2025-05-07T20:24:52.1388824Z 0x0000000000000001 (NEEDED) Shared library: [libtorch_cuda.so] 2025-05-07T20:24:52.1389512Z 0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6] 2025-05-07T20:24:52.1389972Z 0x0000000000000001 (NEEDED) Shared library: [libm.so.6] 2025-05-07T20:24:52.1390426Z 0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1] 2025-05-07T20:24:52.1390883Z 0x0000000000000001 (NEEDED) Shared library: [libc.so.6] 2025-05-07T20:24:52.1391346Z 0x0000000000000001 (NEEDED) Shared library: [ld-linux-x86-64.so.2] 2025-05-07T20:24:52.1391928Z 0x000000000000000e (SONAME) Library soname: [fbgemm.so] 2025-05-07T20:24:52.1392365Z 0x000000000000000f (RPATH) Library rpath: [$ORIGIN] 2025-05-07T20:24:52.1392737Z 0x000000000000000c (INIT) 0xf1000 2025-05-07T20:24:52.1393038Z 0x000000000000000d (FINI) 0x4c0d48 2025-05-07T20:24:52.1393352Z 0x0000000000000019 (INIT_ARRAY) 0x515a40 2025-05-07T20:24:52.1393675Z 0x000000000000001b (INIT_ARRAYSZ) 56 (bytes) 2025-05-07T20:24:52.1393989Z 0x000000000000001a (FINI_ARRAY) 0x515a78 2025-05-07T20:24:52.1394393Z 0x000000000000001c (FINI_ARRAYSZ) 8 (bytes) 2025-05-07T20:24:52.1394702Z 0x000000006ffffef5 (GNU_HASH) 0x298 2025-05-07T20:24:52.1395010Z 0x0000000000000005 (STRTAB) 0x264d0 2025-05-07T20:24:52.1395306Z 0x0000000000000006 (SYMTAB) 0x9ca0 2025-05-07T20:24:52.1395626Z 0x000000000000000a (STRSZ) 707745 (bytes) 2025-05-07T20:24:52.1395948Z 0x000000000000000b (SYMENT) 24 (bytes) 2025-05-07T20:24:52.1396345Z 0x0000000000000003 (PLTGOT) 0x51d000 2025-05-07T20:24:52.1396684Z 0x0000000000000002 (PLTRELSZ) 24240 (bytes) 2025-05-07T20:24:52.1396997Z 0x0000000000000014 (PLTREL) RELA 2025-05-07T20:24:52.1397299Z 0x0000000000000017 (JMPREL) 0xeabd8 2025-05-07T20:24:52.1397590Z 0x0000000000000007 (RELA) 0xd5908 2025-05-07T20:24:52.1397908Z 0x0000000000000008 (RELASZ) 86736 (bytes) 2025-05-07T20:24:52.1398228Z 0x0000000000000009 (RELAENT) 24 (bytes) 2025-05-07T20:24:52.1398548Z 0x000000006ffffffe (VERNEED) 0xd5778 2025-05-07T20:24:52.1398849Z 0x000000006fffffff (VERNEEDNUM) 5 2025-05-07T20:24:52.1399153Z 0x000000006ffffff0 (VERSYM) 0xd3172 2025-05-07T20:24:52.1399457Z 0x000000006ffffff9 (RELACOUNT) 9 2025-05-07T20:24:52.1399734Z 0x0000000000000000 (NULL) 0x0 2025-05-07T20:24:52.1399932Z 2025-05-07T20:24:52.1400044Z ################################################################################ 2025-05-07T20:24:52.1400253Z 2025-05-07T20:24:52.1400258Z 2025-05-07T20:24:52.1400453Z [CHECK] Verifying sample subset of symbols in the built libraries ... 2025-05-07T20:24:52.1660154Z [CHECK] Found symbol in ./_skbuild/linux-x86_64-3.9/cmake-build/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so: fbgemm_gpu::per_tensor_quantize_i8 2025-05-07T20:24:52.1665392Z ################################################################################ 2025-05-07T20:24:52.1665876Z [BUILD] Wheel Audit: dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:52.1666259Z 2025-05-07T20:24:52.1666775Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 auditwheel show dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:52.1667376Z 2025-05-07T20:24:53.1588348Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:24:53.1589418Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:24:54.1126871Z 2025-05-07T20:24:54.1127508Z fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl is 2025-05-07T20:24:54.1128192Z consistent with the following platform tag: "linux_x86_64". 2025-05-07T20:24:54.1128531Z 2025-05-07T20:24:54.1128695Z The wheel references external versioned symbols in these 2025-05-07T20:24:54.1129122Z system-provided shared libraries: libgcc_s.so.1 with versions 2025-05-07T20:24:54.1129751Z {'GCC_3.4', 'GCC_3.0'}, libstdc++.so.6 with versions 2025-05-07T20:24:54.1130137Z {'GLIBCXX_3.4.18', 'CXXABI_1.3.11', 'GLIBCXX_3.4', 'CXXABI_1.3.7', 2025-05-07T20:24:54.1130552Z 'GLIBCXX_3.4.11', 'GLIBCXX_3.4.15', 'CXXABI_1.3.9', 'GLIBCXX_3.4.14', 2025-05-07T20:24:54.1130966Z 'GLIBCXX_3.4.19', 'CXXABI_1.3.3', 'CXXABI_1.3.8', 'CXXABI_1.3', 2025-05-07T20:24:54.1131378Z 'CXXABI_1.3.5', 'GLIBCXX_3.4.21', 'GLIBCXX_3.4.9', 'GLIBCXX_3.4.20'}, 2025-05-07T20:24:54.1131797Z libc.so.6 with versions {'GLIBC_2.17', 'GLIBC_2.7', 'GLIBC_2.4', 2025-05-07T20:24:54.1132215Z 'GLIBC_2.10', 'GLIBC_2.3', 'GLIBC_2.16', 'GLIBC_2.14', 'GLIBC_2.3.4', 2025-05-07T20:24:54.1132616Z 'GLIBC_2.6', 'GLIBC_2.2.5', 'GLIBC_2.3.3', 'GLIBC_2.3.2'}, libm.so.6 2025-05-07T20:24:54.1133044Z with versions {'GLIBC_2.2.5', 'GLIBC_2.27'}, libcudart.so.12 with 2025-05-07T20:24:54.1133504Z versions {'libcudart.so.12'}, libdl.so.2 with versions {'GLIBC_2.2.5', 2025-05-07T20:24:54.1133934Z 'GLIBC_2.3.4'}, librt.so.1 with versions {'GLIBC_2.2.5'}, 2025-05-07T20:24:54.1134332Z libpthread.so.0 with versions {'GLIBC_2.2.5', 'GLIBC_2.12', 2025-05-07T20:24:54.1134748Z 'GLIBC_2.3.2'} 2025-05-07T20:24:54.1134871Z 2025-05-07T20:24:54.1135068Z This constrains the platform tag to "manylinux_2_27_x86_64". In order 2025-05-07T20:24:54.1135545Z to achieve a more compatible tag, you would need to recompile a new 2025-05-07T20:24:54.1136030Z wheel from source on a system with earlier versions of these 2025-05-07T20:24:54.1136405Z libraries, such as a recent manylinux image. 2025-05-07T20:24:54.1698098Z 2025-05-07T20:24:54.1698298Z 2025-05-07T20:24:54.1698902Z ################################################################################ 2025-05-07T20:24:54.1699262Z [BUILD] Enumerating the built wheels ... 2025-05-07T20:24:54.1699716Z + ls -lth dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.1700046Z 2025-05-07T20:24:54.1741796Z -rw-r--r--. 1 root root 18M May 7 20:24 dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.1744251Z 2025-05-07T20:24:54.1744415Z [BUILD] Enumerating the wheel SHAs ... 2025-05-07T20:24:54.1744844Z + sha1sum dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.1745173Z 2025-05-07T20:24:54.2111440Z 603a90ea542be8cba990eaaf5991f74d240b54f7 dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.2114459Z 2025-05-07T20:24:54.2114838Z + sha256sum dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.2115177Z 2025-05-07T20:24:54.2930189Z 27709207cefad2df4ae7ec9c3a55da81178b6955bc77d32b960a1cb5a6eef19d dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.2933217Z 2025-05-07T20:24:54.2933745Z + md5sum dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.2934092Z 2025-05-07T20:24:54.3295275Z 4f0f61dac9f8b96c9bc279e7210d29b9 dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:54.3297940Z 2025-05-07T20:24:54.3298082Z [BUILD] FBGEMM-GPU build + package completed 2025-05-07T20:24:54.3339556Z [NOVA] Time taken to build the package: 2161 seconds / 00:36:01 2025-05-07T20:24:54.3339929Z [NOVA] Copying dist folder to root repo ... 2025-05-07T20:24:54.3340600Z + cp -r /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/dist /__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:24:54.3341001Z 2025-05-07T20:24:54.3498695Z 2025-05-07T20:24:54.3499045Z [NOVA] dist folder has been copied to /__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:24:54.3532714Z total 18440 2025-05-07T20:24:54.3533031Z drwxr-xr-x. 2 root root 81 May 7 20:24 . 2025-05-07T20:24:54.3533371Z drwxr-xr-x. 13 root root 16384 May 7 20:24 .. 2025-05-07T20:24:54.3533888Z -rw-r--r--. 1 root root 18864244 May 7 20:24 fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:24:56.1837392Z ##[group]Run set -euxo pipefail 2025-05-07T20:24:56.1837711Z set -euxo pipefail 2025-05-07T20:24:56.1838096Z source "${BUILD_ENV_FILE}" 2025-05-07T20:24:56.1838620Z export PYTORCH_VERSION="$(${CONDA_RUN} pip show torch | grep ^Version: | sed 's/Version: *//' | sed 's/+.\+//')" 2025-05-07T20:24:56.1839166Z ${CONDA_RUN} python setup.py clean 2025-05-07T20:24:56.1839527Z echo "Successfully ran `python setup.py clean`" 2025-05-07T20:24:56.1839892Z ${CONDA_RUN} python setup.py bdist_wheel 2025-05-07T20:24:56.1840260Z shell: bash -l {0} 2025-05-07T20:24:56.1840458Z env: 2025-05-07T20:24:56.1840650Z PYTHON_VERSION: 3.9 2025-05-07T20:24:56.1840873Z PACKAGE_TYPE: wheel 2025-05-07T20:24:56.1841109Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:24:56.1841349Z REF: 2025-05-07T20:24:56.1841526Z CU_VERSION: cu128 2025-05-07T20:24:56.1841746Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:24:56.1841979Z ARCH: x86_64 2025-05-07T20:24:56.1842209Z BUILD_TARGET: genai 2025-05-07T20:24:56.1842431Z CHANNEL: nightly 2025-05-07T20:24:56.1842689Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:24:56.1843190Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:24:56.1843542Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:24:56.1843940Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:24:56.1844292Z ##[endgroup] 2025-05-07T20:24:56.3976684Z + source /__w/_temp/build_env_14891846312 2025-05-07T20:24:56.3977121Z ++ export BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:24:56.3977425Z ++ BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:24:56.3977718Z ++ export CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:24:56.3978003Z ++ CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:24:56.3978281Z ++ export CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:24:56.3978562Z ++ CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:24:56.3978823Z ++ export FORCE_CUDA=1 2025-05-07T20:24:56.3979064Z ++ FORCE_CUDA=1 2025-05-07T20:24:56.3979864Z ++ export PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:24:56.3981250Z ++ PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:24:56.3982693Z ++ export PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:24:56.3984420Z ++ PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:24:56.3985930Z ++ export 'PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:24:56.3986684Z ++ PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:24:56.3987278Z ++ export PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:24:56.3987697Z ++ PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:24:56.3988047Z ++ export PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:24:56.3988345Z ++ PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:24:56.3988696Z ++ export 'TORCH_CUDA_ARCH_LIST=5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:24:56.3989138Z ++ TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:24:56.3989482Z ++ export VERSION_SUFFIX=+cu128 2025-05-07T20:24:56.3989745Z ++ VERSION_SUFFIX=+cu128 2025-05-07T20:24:56.3989981Z ++ export WHEEL_DIR=cu128/ 2025-05-07T20:24:56.3990219Z ++ WHEEL_DIR=cu128/ 2025-05-07T20:24:56.3990438Z ++ FBGEMM_DIR=/__w/FBGEMM/FBGEMM 2025-05-07T20:24:56.3991117Z ++ export FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:24:56.3991482Z ++ FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:24:56.3991979Z +++ pwd 2025-05-07T20:24:56.3992209Z ++ working_dir=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:24:56.3992699Z ++ [[ /__w/FBGEMM/FBGEMM/pytorch/FBGEMM == \/\_\_\w\/\F\B\G\E\M\M\/\F\B\G\E\M\M\/\p\y\t\o\r\c\h\/\F\B\G\E\M\M ]] 2025-05-07T20:24:56.3993144Z ++ cd fbgemm_gpu 2025-05-07T20:24:56.3993348Z ++ export BUILD_FROM_NOVA=1 2025-05-07T20:24:56.3993584Z ++ BUILD_FROM_NOVA=1 2025-05-07T20:24:56.3993787Z ++ [[ cu128 == \c\u* ]] 2025-05-07T20:24:56.3994141Z ++ echo 'Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:24:56.3994578Z ++ [[ /__w/_temp/conda_environment_14891846312 != '' ]] 2025-05-07T20:24:56.3995054Z ++ export 'CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:24:56.3995667Z ++ CONDA_RUN='conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:24:56.3996224Z ++ echo 'conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:24:56.3996631Z ++ [[ cu128 == \c\u\1\2\8 ]] 2025-05-07T20:24:56.3997010Z ++ export 'TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:24:56.3997449Z ++ TORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:24:56.3997973Z ++ echo 'Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:24:56.3998512Z Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0 2025-05-07T20:24:56.3999219Z conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:24:56.3999745Z Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T20:24:56.4000235Z ++ conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 pip show torch 2025-05-07T20:24:56.4000663Z ++ grep '^Version:' 2025-05-07T20:24:56.4000867Z ++ sed 's/Version: *//' 2025-05-07T20:24:56.4001086Z ++ sed 's/+.\+//' 2025-05-07T20:24:57.3111549Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:24:57.3112111Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:24:57.7805609Z + export PYTORCH_VERSION=2.8.0.dev20250507 2025-05-07T20:24:57.7805951Z + PYTORCH_VERSION=2.8.0.dev20250507 2025-05-07T20:24:57.7806437Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python setup.py clean 2025-05-07T20:24:58.6897463Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:24:58.6897947Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:25:00.0675237Z [SETUP.PY] ARGV: ['setup.py', 'clean'] 2025-05-07T20:25:00.0677207Z [SETUP.PY] Parsed setup.py arguments: Namespace(verbose=False, debug=False, dryrun=False, build_target='default', build_variant='cuda', package_channel='nightly', nvml_lib_path=None, nccl_lib_path=None, use_fb_only=False, cxxprefix=None) 2025-05-07T20:25:00.0678370Z [SETUP.PY] Other arguments: ['clean'] 2025-05-07T20:25:00.0678827Z [SETUP.PY] Running under Nova workflow context (clean or build wheel step) ... exiting 2025-05-07T20:25:00.4164520Z ++ python setup.py clean 2025-05-07T20:25:01.3297397Z Traceback (most recent call last): 2025-05-07T20:25:01.3298087Z File "/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu/setup.py", line 21, in 2025-05-07T20:25:01.3298569Z import setuptools_git_versioning as gitversion 2025-05-07T20:25:01.3299005Z ModuleNotFoundError: No module named 'setuptools_git_versioning' 2025-05-07T20:25:01.3485430Z + echo 'Successfully ran ' 2025-05-07T20:25:01.3486005Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python setup.py bdist_wheel 2025-05-07T20:25:01.3486573Z Successfully ran 2025-05-07T20:25:02.2534174Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:25:02.2534630Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:25:03.6254367Z [SETUP.PY] ARGV: ['setup.py', 'bdist_wheel'] 2025-05-07T20:25:03.6257023Z [SETUP.PY] Parsed setup.py arguments: Namespace(verbose=False, debug=False, dryrun=False, build_target='default', build_variant='cuda', package_channel='nightly', nvml_lib_path=None, nccl_lib_path=None, use_fb_only=False, cxxprefix=None) 2025-05-07T20:25:03.6258969Z [SETUP.PY] Other arguments: ['bdist_wheel'] 2025-05-07T20:25:03.6259451Z [SETUP.PY] Running under Nova workflow context (clean or build wheel step) ... exiting 2025-05-07T20:25:03.9823821Z ##[group]Run set -euxo pipefail 2025-05-07T20:25:03.9824146Z set -euxo pipefail 2025-05-07T20:25:03.9824405Z source "${BUILD_ENV_FILE}" 2025-05-07T20:25:03.9824744Z for pkg in pytorch/FBGEMM/dist/*-linux_*.whl; do 2025-05-07T20:25:03.9825100Z  # if the glob didn't match anything 2025-05-07T20:25:03.9825403Z  if [[ ! -e $pkg ]]; then 2025-05-07T20:25:03.9825667Z  continue 2025-05-07T20:25:03.9825882Z  fi 2025-05-07T20:25:03.9826098Z  abs_pkg=$(realpath $pkg) 2025-05-07T20:25:03.9826475Z  ./test-infra/.github/scripts/repair_manylinux_2_28.sh $abs_pkg 2025-05-07T20:25:03.9826860Z done 2025-05-07T20:25:03.9827145Z shell: bash -l {0} 2025-05-07T20:25:03.9827463Z env: 2025-05-07T20:25:03.9827645Z PYTHON_VERSION: 3.9 2025-05-07T20:25:03.9827874Z PACKAGE_TYPE: wheel 2025-05-07T20:25:03.9828112Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:25:03.9828342Z REF: 2025-05-07T20:25:03.9828543Z CU_VERSION: cu128 2025-05-07T20:25:03.9828755Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:25:03.9829048Z ARCH: x86_64 2025-05-07T20:25:03.9829294Z BUILD_TARGET: genai 2025-05-07T20:25:03.9829568Z CHANNEL: nightly 2025-05-07T20:25:03.9829817Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:25:03.9830151Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:25:03.9830495Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:03.9830887Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:03.9831275Z PACKAGE_NAME: fbgemm_gpu 2025-05-07T20:25:03.9831505Z SMOKE_TEST_SCRIPT: 2025-05-07T20:25:03.9831717Z ##[endgroup] 2025-05-07T20:25:04.1864149Z + source /__w/_temp/build_env_14891846312 2025-05-07T20:25:04.1864701Z ++ export BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:25:04.1865118Z ++ BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:25:04.1865434Z ++ export CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:25:04.1865746Z ++ CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:25:04.1866025Z ++ export CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:25:04.1866318Z ++ CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:25:04.1866582Z ++ export FORCE_CUDA=1 2025-05-07T20:25:04.1866878Z ++ FORCE_CUDA=1 2025-05-07T20:25:04.1867752Z ++ export PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.1869208Z ++ PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.1870957Z ++ export PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.1873023Z ++ PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.1874272Z ++ export 'PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:25:04.1875234Z ++ PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:25:04.1875866Z ++ export PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:25:04.1876370Z ++ PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:25:04.1876723Z ++ export PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.1877282Z ++ PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.1877646Z ++ export 'TORCH_CUDA_ARCH_LIST=5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:25:04.1878254Z ++ TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:25:04.1879006Z ++ export VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.1879367Z ++ VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.1879693Z ++ export WHEEL_DIR=cu128/ 2025-05-07T20:25:04.1880012Z ++ WHEEL_DIR=cu128/ 2025-05-07T20:25:04.1880292Z ++ FBGEMM_DIR=/__w/FBGEMM/FBGEMM 2025-05-07T20:25:04.1880706Z ++ export FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:04.1881120Z ++ FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:04.1881498Z +++ pwd 2025-05-07T20:25:04.1881690Z ++ working_dir=/__w/FBGEMM/FBGEMM 2025-05-07T20:25:04.1882073Z ++ [[ /__w/FBGEMM/FBGEMM == \/\_\_\w\/\F\B\G\E\M\M\/\F\B\G\E\M\M\/\p\y\t\o\r\c\h\/\F\B\G\E\M\M ]] 2025-05-07T20:25:04.1882527Z ++ export BUILD_FROM_NOVA=1 2025-05-07T20:25:04.1882836Z ++ BUILD_FROM_NOVA=1 2025-05-07T20:25:04.1883105Z ++ [[ cu128 == \c\u* ]] 2025-05-07T20:25:04.1883615Z ++ echo 'Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:25:04.1884137Z ++ [[ /__w/_temp/conda_environment_14891846312 != '' ]] 2025-05-07T20:25:04.1884628Z ++ export 'CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:25:04.1885756Z ++ CONDA_RUN='conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:25:04.1886360Z ++ echo 'conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:25:04.1886771Z ++ [[ cu128 == \c\u\1\2\8 ]] 2025-05-07T20:25:04.1887072Z ++ export 'TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:25:04.1887562Z ++ TORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:25:04.1887966Z ++ echo 'Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:25:04.1888343Z + for pkg in pytorch/FBGEMM/dist/*-linux_*.whl 2025-05-07T20:25:04.1888736Z + [[ ! -e pytorch/FBGEMM/dist/*-linux_*.whl ]] 2025-05-07T20:25:04.1889152Z + continue 2025-05-07T20:25:04.1889511Z Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0 2025-05-07T20:25:04.1890195Z conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.1890803Z Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T20:25:04.1952811Z Prepare all required actions 2025-05-07T20:25:04.1953152Z Getting action download info 2025-05-07T20:25:04.3333984Z ##[group]Run ./test-infra/.github/actions/run-script-with-cache 2025-05-07T20:25:04.3334368Z with: 2025-05-07T20:25:04.3334583Z repository: pytorch/FBGEMM 2025-05-07T20:25:04.3334895Z script: ../.github/scripts/nova_postscript.bash 2025-05-07T20:25:04.3335203Z is_windows: disabled 2025-05-07T20:25:04.3335416Z env: 2025-05-07T20:25:04.3335597Z PYTHON_VERSION: 3.9 2025-05-07T20:25:04.3347296Z PACKAGE_TYPE: wheel 2025-05-07T20:25:04.3347548Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:25:04.3347776Z REF: 2025-05-07T20:25:04.3347953Z CU_VERSION: cu128 2025-05-07T20:25:04.3348159Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:25:04.3348377Z ARCH: x86_64 2025-05-07T20:25:04.3348569Z BUILD_TARGET: genai 2025-05-07T20:25:04.3348773Z CHANNEL: nightly 2025-05-07T20:25:04.3349015Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:25:04.3349375Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:25:04.3349737Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.3350159Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.3350496Z ##[endgroup] 2025-05-07T20:25:04.3370733Z ##[group]Run echo "today=$(/bin/date -u '+%Y%m%d')d" >> "${GITHUB_OUTPUT}" 2025-05-07T20:25:04.3371223Z echo "today=$(/bin/date -u '+%Y%m%d')d" >> "${GITHUB_OUTPUT}" 2025-05-07T20:25:04.3371675Z shell: bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:25:04.3371971Z env: 2025-05-07T20:25:04.3372273Z PYTHON_VERSION: 3.9 2025-05-07T20:25:04.3372505Z PACKAGE_TYPE: wheel 2025-05-07T20:25:04.3372735Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:25:04.3372971Z REF: 2025-05-07T20:25:04.3373145Z CU_VERSION: cu128 2025-05-07T20:25:04.3373363Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:25:04.3373590Z ARCH: x86_64 2025-05-07T20:25:04.3373799Z BUILD_TARGET: genai 2025-05-07T20:25:04.3374019Z CHANNEL: nightly 2025-05-07T20:25:04.3374263Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:25:04.3374602Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:25:04.3374943Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.3375342Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.3375683Z ##[endgroup] 2025-05-07T20:25:04.5091355Z ##[group]Run # Windows scripts needs cleanup on audio and vision, todo remove this once resolved 2025-05-07T20:25:04.5092085Z # Windows scripts needs cleanup on audio and vision, todo remove this once resolved 2025-05-07T20:25:04.5092548Z if [[ disabled == 'disabled' ]]; then 2025-05-07T20:25:04.5092969Z  set -euxo pipefail 2025-05-07T20:25:04.5093211Z fi 2025-05-07T20:25:04.5093415Z source "${BUILD_ENV_FILE}" 2025-05-07T20:25:04.5093673Z  2025-05-07T20:25:04.5093867Z if [[ ! -f ${SCRIPT} ]]; then 2025-05-07T20:25:04.5094307Z  echo "::error::Specified script file (${SCRIPT}) not found, not going execute it" 2025-05-07T20:25:04.5094715Z  exit 1 2025-05-07T20:25:04.5094911Z else 2025-05-07T20:25:04.5095127Z  if [[ ${SCRIPT} == *.bat ]]; then 2025-05-07T20:25:04.5095419Z  ${CONDA_RUN} ${SCRIPT} 2025-05-07T20:25:04.5095673Z  else 2025-05-07T20:25:04.5095891Z  ${CONDA_RUN} bash ${SCRIPT} 2025-05-07T20:25:04.5096153Z  fi 2025-05-07T20:25:04.5096337Z fi 2025-05-07T20:25:04.5096593Z shell: bash -l {0} 2025-05-07T20:25:04.5096791Z env: 2025-05-07T20:25:04.5096979Z PYTHON_VERSION: 3.9 2025-05-07T20:25:04.5097228Z PACKAGE_TYPE: wheel 2025-05-07T20:25:04.5097463Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:25:04.5097693Z REF: 2025-05-07T20:25:04.5097876Z CU_VERSION: cu128 2025-05-07T20:25:04.5098087Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:25:04.5098318Z ARCH: x86_64 2025-05-07T20:25:04.5098512Z BUILD_TARGET: genai 2025-05-07T20:25:04.5098939Z CHANNEL: nightly 2025-05-07T20:25:04.5099187Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:25:04.5099529Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:25:04.5099869Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.5100274Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.5100671Z SCRIPT: ../.github/scripts/nova_postscript.bash 2025-05-07T20:25:04.5100961Z ##[endgroup] 2025-05-07T20:25:04.7165476Z + source /__w/_temp/build_env_14891846312 2025-05-07T20:25:04.7165878Z ++ export BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:25:04.7166189Z ++ BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:25:04.7166478Z ++ export CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:25:04.7166771Z ++ CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:25:04.7167048Z ++ export CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:25:04.7167329Z ++ CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:25:04.7167603Z ++ export FORCE_CUDA=1 2025-05-07T20:25:04.7167821Z ++ FORCE_CUDA=1 2025-05-07T20:25:04.7168608Z ++ export PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.7170013Z ++ PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.7171452Z ++ export PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.7173267Z ++ PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:04.7174416Z ++ export 'PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:25:04.7175164Z ++ PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:25:04.7175746Z ++ export PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:25:04.7176161Z ++ PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:25:04.7176510Z ++ export PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.7176800Z ++ PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.7177159Z ++ export 'TORCH_CUDA_ARCH_LIST=5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:25:04.7177697Z ++ TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:25:04.7178044Z ++ export VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.7178303Z ++ VERSION_SUFFIX=+cu128 2025-05-07T20:25:04.7178536Z ++ export WHEEL_DIR=cu128/ 2025-05-07T20:25:04.7178766Z ++ WHEEL_DIR=cu128/ 2025-05-07T20:25:04.7178987Z ++ FBGEMM_DIR=/__w/FBGEMM/FBGEMM 2025-05-07T20:25:04.7179299Z ++ export FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:04.7179648Z ++ FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:04.7179999Z +++ pwd 2025-05-07T20:25:04.7180235Z ++ working_dir=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:04.7180722Z ++ [[ /__w/FBGEMM/FBGEMM/pytorch/FBGEMM == \/\_\_\w\/\F\B\G\E\M\M\/\F\B\G\E\M\M\/\p\y\t\o\r\c\h\/\F\B\G\E\M\M ]] 2025-05-07T20:25:04.7181166Z ++ cd fbgemm_gpu 2025-05-07T20:25:04.7181369Z ++ export BUILD_FROM_NOVA=1 2025-05-07T20:25:04.7181607Z ++ BUILD_FROM_NOVA=1 2025-05-07T20:25:04.7181813Z ++ [[ cu128 == \c\u* ]] 2025-05-07T20:25:04.7182167Z ++ echo 'Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:25:04.7182600Z ++ [[ /__w/_temp/conda_environment_14891846312 != '' ]] 2025-05-07T20:25:04.7183083Z ++ export 'CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:25:04.7183880Z ++ CONDA_RUN='conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:25:04.7184439Z ++ echo 'conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:25:04.7185158Z ++ [[ cu128 == \c\u\1\2\8 ]] 2025-05-07T20:25:04.7185488Z ++ export 'TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:25:04.7185856Z ++ TORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:25:04.7186229Z ++ echo 'Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:25:04.7186615Z + [[ ! -f ../.github/scripts/nova_postscript.bash ]] 2025-05-07T20:25:04.7186978Z + [[ ../.github/scripts/nova_postscript.bash == *.bat ]] 2025-05-07T20:25:04.7187561Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 bash ../.github/scripts/nova_postscript.bash 2025-05-07T20:25:04.7188217Z Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0 2025-05-07T20:25:04.7188719Z conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:04.7189180Z Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T20:25:05.6306412Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:25:05.6306801Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:25:05.6713501Z [NOVA] Current working directory: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:25:05.6719304Z [NOVA] Current working directory: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:06.2650733Z ################################################################################ 2025-05-07T20:25:06.2651371Z Environment Variables: 2025-05-07T20:25:06.2676452Z CONDA_SHLVL=2 2025-05-07T20:25:06.2677055Z LD_LIBRARY_PATH=/usr/local/lib:/usr/local/cuda-12.8/lib64:/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: 2025-05-07T20:25:06.2677661Z CONDA_EXE=/opt/conda/bin/conda 2025-05-07T20:25:06.2677923Z KERN_NAME=Linux 2025-05-07T20:25:06.2678127Z ARCH=x86_64 2025-05-07T20:25:06.2678376Z MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD 2025-05-07T20:25:06.2678693Z LANG=en_US.UTF-8 2025-05-07T20:25:06.2678904Z HISTCONTROL=ignoredups 2025-05-07T20:25:06.2679131Z HOSTNAME=d075a29d5e53 2025-05-07T20:25:06.2679484Z JAVA_LD_LIBRARY_PATH=/__w/_temp/conda_environment_14891846312/lib/jvm/lib/server 2025-05-07T20:25:06.2679888Z GITHUB_REF_NAME=4066/merge 2025-05-07T20:25:06.2680167Z OLDPWD=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:25:06.2680474Z NVCC_VERBOSE=1 2025-05-07T20:25:06.2680715Z GITHUB_API_URL=https://api.github.com 2025-05-07T20:25:06.2681006Z PLATFORM_NAME_LC=linux-x86_64 2025-05-07T20:25:06.2681278Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-05-07T20:25:06.2681639Z CHANNEL=nightly 2025-05-07T20:25:06.2682056Z GITHUB_STEP_SUMMARY=/__w/_temp/_runner_file_commands/step_summary_0a581480-253c-4a0c-a7bc-30e85f77e1c1 2025-05-07T20:25:06.2682540Z CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:25:06.2682974Z GITHUB_ACTION_PATH=/__w/FBGEMM/FBGEMM/./test-infra/.github/actions/run-script-with-cache 2025-05-07T20:25:06.2683412Z GITHUB_RUN_ATTEMPT=1 2025-05-07T20:25:06.2683652Z GSETTINGS_SCHEMA_DIR_CONDA_BACKUP= 2025-05-07T20:25:06.2683917Z MACHINE_NAME_LC=x86_64 2025-05-07T20:25:06.2684140Z RUNNER_TOOL_CACHE=/__w/_tool 2025-05-07T20:25:06.2684541Z CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:06.2685347Z CONDA_PREFIX=/__w/_temp/conda_environment_14891846312 2025-05-07T20:25:06.2685732Z JAVA_HOME=/__w/_temp/conda_environment_14891846312/lib/jvm 2025-05-07T20:25:06.2686065Z BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:25:06.2686382Z CONDA_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T20:25:06.2686708Z RUNNER_ENVIRONMENT=self-hosted 2025-05-07T20:25:06.2686962Z MACHINE_NAME=x86_64 2025-05-07T20:25:06.2687206Z GITHUB_REPOSITORY_OWNER=pytorch 2025-05-07T20:25:06.2687457Z GITHUB_ACTIONS=true 2025-05-07T20:25:06.2687665Z KERN_NAME_LC=linux 2025-05-07T20:25:06.2688461Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/build_wheels_genai_linux_x86.yml@refs/pull/4066/merge 2025-05-07T20:25:06.2688984Z _CE_M= 2025-05-07T20:25:06.2689170Z which_declare=declare -f 2025-05-07T20:25:06.2689395Z CI=true 2025-05-07T20:25:06.2689596Z CUDNN_LIBRARY=/usr/local/cuda-12.8/lib64 2025-05-07T20:25:06.2689937Z MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl 2025-05-07T20:25:06.2690238Z USER=root 2025-05-07T20:25:06.2690443Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6 2025-05-07T20:25:06.2690719Z CONDA_PREFIX_1=/opt/conda 2025-05-07T20:25:06.2691025Z CU_VERSION=cu128 2025-05-07T20:25:06.2691221Z GITHUB_ACTOR=q10 2025-05-07T20:25:06.2691423Z GITHUB_ACTION_REF= 2025-05-07T20:25:06.2691630Z GITHUB_ACTION=__self_4 2025-05-07T20:25:06.2691865Z GITHUB_REF_PROTECTED=false 2025-05-07T20:25:06.2692091Z WHEEL_DIR=cu128/ 2025-05-07T20:25:06.2692458Z *** 2025-05-07T20:25:06.2692633Z VERSION_SUFFIX=+cu128 2025-05-07T20:25:06.2692853Z HOME=/github/home 2025-05-07T20:25:06.2693088Z CONDA_PYTHON_EXE=/opt/conda/bin/python 2025-05-07T20:25:06.2693543Z GITHUB_STATE=/__w/_temp/_runner_file_commands/save_state_0a581480-253c-4a0c-a7bc-30e85f77e1c1 2025-05-07T20:25:06.2694025Z ARTIFACT_NAME=pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:25:06.2694329Z CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:25:06.2694590Z GITHUB_ACTION_REPOSITORY= 2025-05-07T20:25:06.2694823Z GITHUB_REF_TYPE=branch 2025-05-07T20:25:06.2695047Z RUNNER_TEMP=/__w/_temp 2025-05-07T20:25:06.2695263Z BUILD_FROM_NOVA=1 2025-05-07T20:25:06.2695479Z GITHUB_RETENTION_DAYS=90 2025-05-07T20:25:06.2695690Z REF= 2025-05-07T20:25:06.2695866Z GITHUB_ENV=TRUE 2025-05-07T20:25:06.2696095Z SSL_CERT_FILE=/opt/_internal/certs.pem 2025-05-07T20:25:06.2696443Z RUNNER_WORKSPACE=/__w/FBGEMM 2025-05-07T20:25:06.2696698Z GITHUB_REF=refs/pull/4066/merge 2025-05-07T20:25:06.2696985Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 2025-05-07T20:25:06.2697282Z _CE_CONDA= 2025-05-07T20:25:06.2697615Z GSETTINGS_SCHEMA_DIR=/__w/_temp/conda_environment_14891846312/share/glib-2.0/schemas 2025-05-07T20:25:06.2698273Z __CONDA_SHLVL_1_LD_LIBRARY_PATH=/opt/rh/gcc-toolset-11/root/usr/lib64:/opt/rh/gcc-toolset-11/root/usr/lib: 2025-05-07T20:25:06.2698771Z GITHUB_REPOSITORY_ID=150154628 2025-05-07T20:25:06.2699023Z GITHUB_RUN_ID=14891846312 2025-05-07T20:25:06.2699292Z FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:25:06.2699636Z BUILD_ENV_FILE=/__w/_temp/build_env_14891846312 2025-05-07T20:25:06.2699956Z RUNNER_ARCH=X64 2025-05-07T20:25:06.2700179Z GITHUB_SERVER_URL=https://github.com 2025-05-07T20:25:06.2700690Z PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128 2025-05-07T20:25:06.2701193Z REPOSITORY=pytorch/FBGEMM 2025-05-07T20:25:06.2701434Z GITHUB_ACTOR_ID=255046 2025-05-07T20:25:06.2702016Z NVCC_PREPEND_FLAGS=-std=c++20 -Xcompiler -std=c++20 -ccbin /opt/rh/gcc-toolset-11/root/usr/bin/c++ -allow-unsupported-compiler 2025-05-07T20:25:06.2702569Z LOADEDMODULES= 2025-05-07T20:25:06.2702779Z UPLOAD_TO_BASE_BUCKET=no 2025-05-07T20:25:06.2703043Z GITHUB_EVENT_PATH=/github/workflow/event.json 2025-05-07T20:25:06.2703442Z CONDA_PROMPT_MODIFIER=(/__w/_temp/conda_environment_14891846312) 2025-05-07T20:25:06.2703807Z PLATFORM_NAME=Linux-x86_64 2025-05-07T20:25:06.2704046Z PACKAGE_TYPE=wheel 2025-05-07T20:25:06.2704308Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-05-07T20:25:06.2704629Z MAIL=/var/spool/mail/root 2025-05-07T20:25:06.2706302Z RUNNER_OS=Linux 2025-05-07T20:25:06.2706503Z GITHUB_BASE_REF=main 2025-05-07T20:25:06.2706706Z FORCE_CUDA=1 2025-05-07T20:25:06.2706937Z TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T20:25:06.2707398Z GITHUB_PATH=/__w/_temp/_runner_file_commands/add_path_0a581480-253c-4a0c-a7bc-30e85f77e1c1 2025-05-07T20:25:06.2720534Z GITHUB_JOB=build 2025-05-07T20:25:06.2720796Z BUILD_TARGET=genai 2025-05-07T20:25:06.2721047Z CUDNN_INCLUDE_DIR=/usr/local/cuda-12.8/include 2025-05-07T20:25:06.2721375Z RUNNER_NAME=i-0ca7df191fe9703a4 2025-05-07T20:25:06.2721625Z PYTHON_VERSION=3.9 2025-05-07T20:25:06.2721845Z CONDA_ROOT=/opt/conda 2025-05-07T20:25:06.2722373Z GITHUB_OUTPUT=/__w/_temp/_runner_file_commands/set_output_0a581480-253c-4a0c-a7bc-30e85f77e1c1 2025-05-07T20:25:06.2722915Z PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:25:06.2723172Z SHLVL=3 2025-05-07T20:25:06.2723371Z LANGUAGE=en_US.UTF-8 2025-05-07T20:25:06.2723630Z GITHUB_REPOSITORY=pytorch/FBGEMM 2025-05-07T20:25:06.2723900Z MANPATH=: 2025-05-07T20:25:06.2724153Z SCRIPT=../.github/scripts/nova_postscript.bash 2025-05-07T20:25:06.2724499Z GITHUB_EVENT_NAME=pull_request 2025-05-07T20:25:06.2725082Z MODULEPATH=/etc/scl/modulefiles:/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles 2025-05-07T20:25:06.2725678Z LOGNAME=root 2025-05-07T20:25:06.2726169Z MODULEPATH_modshare=/usr/share/Modules/modulefiles:2:/etc/modulefiles:2:/usr/share/modulefiles:2 2025-05-07T20:25:06.2726738Z GITHUB_RUN_NUMBER=1266 2025-05-07T20:25:06.2727038Z GITHUB_WORKFLOW=Build FBGEMM GenAI x86 Linux Wheels 2025-05-07T20:25:06.2728476Z PATH=/__w/_temp/conda_environment_14891846312/bin:/opt/conda/condabin:/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:25:06.2729935Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e 2025-05-07T20:25:06.2730395Z DEBUGINFOD_URLS=https://debuginfod.centos.org/ 2025-05-07T20:25:06.2730757Z GITHUB_WORKSPACE=/__w/FBGEMM/FBGEMM 2025-05-07T20:25:06.2731062Z MODULESHOME=/usr/share/Modules 2025-05-07T20:25:06.2731408Z CONDA_DEFAULT_ENV=/__w/_temp/conda_environment_14891846312 2025-05-07T20:25:06.2731842Z GITHUB_TRIGGERING_ACTOR=q10 2025-05-07T20:25:06.2732098Z HISTSIZE=1000 2025-05-07T20:25:06.2732380Z PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:25:06.2732764Z LESSOPEN=||/usr/bin/lesspipe.sh %s 2025-05-07T20:25:06.2733056Z BASH_FUNC_which%%=() { ( alias; 2025-05-07T20:25:06.2733609Z eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@ 2025-05-07T20:25:06.2734171Z } 2025-05-07T20:25:06.2734384Z BASH_FUNC_module%%=() { unset _mlshdbg; 2025-05-07T20:25:06.2734733Z if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then 2025-05-07T20:25:06.2735061Z case "$-" in 2025-05-07T20:25:06.2735258Z *v*x*) 2025-05-07T20:25:06.2735444Z set +vx; 2025-05-07T20:25:06.2735625Z _mlshdbg='vx' 2025-05-07T20:25:06.2735821Z ;; 2025-05-07T20:25:06.2735991Z *v*) 2025-05-07T20:25:06.2736169Z set +v; 2025-05-07T20:25:06.2736344Z _mlshdbg='v' 2025-05-07T20:25:06.2736538Z ;; 2025-05-07T20:25:06.2736703Z *x*) 2025-05-07T20:25:06.2736879Z set +x; 2025-05-07T20:25:06.2737057Z _mlshdbg='x' 2025-05-07T20:25:06.2737250Z ;; 2025-05-07T20:25:06.2737466Z *) 2025-05-07T20:25:06.2737642Z _mlshdbg='' 2025-05-07T20:25:06.2737826Z ;; 2025-05-07T20:25:06.2738001Z esac; 2025-05-07T20:25:06.2738177Z fi; 2025-05-07T20:25:06.2738357Z unset _mlre _mlIFS; 2025-05-07T20:25:06.2738599Z if [ -n "${IFS+x}" ]; then 2025-05-07T20:25:06.2738838Z _mlIFS=$IFS; 2025-05-07T20:25:06.2739036Z fi; 2025-05-07T20:25:06.2739205Z IFS=' '; 2025-05-07T20:25:06.2739427Z for _mlv in ${MODULES_RUN_QUARANTINE:-}; 2025-05-07T20:25:06.2739722Z do 2025-05-07T20:25:06.2740003Z if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then 2025-05-07T20:25:06.2740451Z if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then 2025-05-07T20:25:06.2740834Z _mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' "; 2025-05-07T20:25:06.2741187Z fi; 2025-05-07T20:25:06.2741382Z _mlrv="MODULES_RUNENV_${_mlv}"; 2025-05-07T20:25:06.2741722Z _mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' "; 2025-05-07T20:25:06.2742058Z fi; 2025-05-07T20:25:06.2742236Z done; 2025-05-07T20:25:06.2742427Z if [ -n "${_mlre:-}" ]; then 2025-05-07T20:25:06.2742860Z eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`; 2025-05-07T20:25:06.2743276Z else 2025-05-07T20:25:06.2743587Z eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`; 2025-05-07T20:25:06.2744044Z fi; 2025-05-07T20:25:06.2744215Z _mlstatus=$?; 2025-05-07T20:25:06.2744419Z if [ -n "${_mlIFS+x}" ]; then 2025-05-07T20:25:06.2744645Z IFS=$_mlIFS; 2025-05-07T20:25:06.2744822Z else 2025-05-07T20:25:06.2744985Z unset IFS; 2025-05-07T20:25:06.2745156Z fi; 2025-05-07T20:25:06.2745335Z unset _mlre _mlv _mlrv _mlIFS; 2025-05-07T20:25:06.2745592Z if [ -n "${_mlshdbg:-}" ]; then 2025-05-07T20:25:06.2745826Z set -$_mlshdbg; 2025-05-07T20:25:06.2746016Z fi; 2025-05-07T20:25:06.2746179Z unset _mlshdbg; 2025-05-07T20:25:06.2746375Z return $_mlstatus 2025-05-07T20:25:06.2746565Z } 2025-05-07T20:25:06.2746778Z BASH_FUNC_switchml%%=() { typeset swfound=1; 2025-05-07T20:25:06.2747116Z if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then 2025-05-07T20:25:06.2747425Z typeset swname='main'; 2025-05-07T20:25:06.2747719Z if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then 2025-05-07T20:25:06.2748030Z typeset swfound=0; 2025-05-07T20:25:06.2748256Z unset MODULES_USE_COMPAT_VERSION; 2025-05-07T20:25:06.2748499Z fi; 2025-05-07T20:25:06.2748662Z else 2025-05-07T20:25:06.2748854Z typeset swname='compatibility'; 2025-05-07T20:25:06.2749185Z if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then 2025-05-07T20:25:06.2749510Z typeset swfound=0; 2025-05-07T20:25:06.2749734Z MODULES_USE_COMPAT_VERSION=1; 2025-05-07T20:25:06.2749996Z export MODULES_USE_COMPAT_VERSION; 2025-05-07T20:25:06.2750249Z fi; 2025-05-07T20:25:06.2750412Z fi; 2025-05-07T20:25:06.2750589Z if [ $swfound -eq 0 ]; then 2025-05-07T20:25:06.2750861Z echo "Switching to Modules $swname version"; 2025-05-07T20:25:06.2751213Z source /usr/share/Modules/init/bash; 2025-05-07T20:25:06.2751470Z else 2025-05-07T20:25:06.2751751Z echo "Cannot switch to Modules $swname version, command not found"; 2025-05-07T20:25:06.2752205Z return 1; 2025-05-07T20:25:06.2752374Z fi 2025-05-07T20:25:06.2752535Z } 2025-05-07T20:25:06.2752782Z BASH_FUNC_scl%%=() { if [ "$1" = "load" -o "$1" = "unload" ]; then 2025-05-07T20:25:06.2753120Z eval "module $@"; 2025-05-07T20:25:06.2753307Z else 2025-05-07T20:25:06.2753480Z /usr/bin/scl "$@"; 2025-05-07T20:25:06.2753672Z fi 2025-05-07T20:25:06.2753827Z } 2025-05-07T20:25:06.2754006Z BASH_FUNC_ml%%=() { module ml "$@" 2025-05-07T20:25:06.2754241Z } 2025-05-07T20:25:06.2754410Z _=/usr/bin/printenv 2025-05-07T20:25:06.2754640Z ################################################################################ 2025-05-07T20:25:06.2754946Z ################################################################################ 2025-05-07T20:25:06.2755328Z # Collect PyTorch Environment Information (for Reporting Issues) 2025-05-07T20:25:06.2755671Z # 2025-05-07T20:25:06.2756069Z # [2025-05-07T20:25:06.271Z] + collect_pytorch_env_info /__w/_temp/conda_environment_14891846312 2025-05-07T20:25:06.2756515Z ################################################################################ 2025-05-07T20:25:06.2756718Z 2025-05-07T20:25:06.2756897Z [EXEC] [ATTEMPT 0/3] + wget -q --timeout 1 pypi.org -O /dev/null 2025-05-07T20:25:06.4113898Z [CHECK] Network does not appear to be blocked. 2025-05-07T20:25:06.4129386Z [INFO] Downloading the PyTorch environment info collection script ... 2025-05-07T20:25:06.4130002Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py 2025-05-07T20:25:06.4130391Z 2025-05-07T20:25:06.5333378Z 2025-05-07T20:25:06.5333789Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ... 2025-05-07T20:25:06.5377539Z [EXEC] [ATTEMPT 0/3] + conda run -p /__w/_temp/conda_environment_14891846312 python collect_env.py 2025-05-07T20:25:11.2961430Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:11.2963569Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:11.2963795Z 2025-05-07T20:25:11.2963913Z Collecting environment information... 2025-05-07T20:25:11.2964208Z PyTorch version: 2.8.0.dev20250507+cu128 2025-05-07T20:25:11.2964492Z Is debug build: False 2025-05-07T20:25:11.2964727Z CUDA used to build PyTorch: 12.8 2025-05-07T20:25:11.2965001Z ROCM used to build PyTorch: N/A 2025-05-07T20:25:11.2965171Z 2025-05-07T20:25:11.2965288Z OS: AlmaLinux 8.10 (Cerulean Leopard) (x86_64) 2025-05-07T20:25:11.2965630Z GCC version: (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9) 2025-05-07T20:25:11.2965954Z Clang version: Could not collect 2025-05-07T20:25:11.2966228Z CMake version: version 4.0.0 2025-05-07T20:25:11.2966478Z Libc version: glibc-2.28 2025-05-07T20:25:11.2966627Z 2025-05-07T20:25:11.2966923Z Python version: 3.9.22 | packaged by conda-forge | (main, Apr 14 2025, 23:35:59) [GCC 13.3.0] (64-bit runtime) 2025-05-07T20:25:11.2967531Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.28 2025-05-07T20:25:11.2967921Z Is CUDA available: False 2025-05-07T20:25:11.2968165Z CUDA runtime version: 12.8.61 2025-05-07T20:25:11.2968415Z CUDA_MODULE_LOADING set to: N/A 2025-05-07T20:25:11.2968708Z GPU models and configuration: Could not collect 2025-05-07T20:25:11.2969037Z Nvidia driver version: Could not collect 2025-05-07T20:25:11.2969323Z cuDNN version: Could not collect 2025-05-07T20:25:11.2969583Z HIP runtime version: N/A 2025-05-07T20:25:11.2969820Z MIOpen runtime version: N/A 2025-05-07T20:25:11.2970068Z Is XNNPACK available: True 2025-05-07T20:25:11.2970349Z 2025-05-07T20:25:11.2970416Z CPU: 2025-05-07T20:25:11.2970601Z Architecture: x86_64 2025-05-07T20:25:11.2970849Z CPU op-mode(s): 32-bit, 64-bit 2025-05-07T20:25:11.2971118Z Byte Order: Little Endian 2025-05-07T20:25:11.2971363Z CPU(s): 16 2025-05-07T20:25:11.2971586Z On-line CPU(s) list: 0-15 2025-05-07T20:25:11.2971817Z Thread(s) per core: 2 2025-05-07T20:25:11.2972058Z Core(s) per socket: 8 2025-05-07T20:25:11.2972302Z Socket(s): 1 2025-05-07T20:25:11.2972505Z NUMA node(s): 1 2025-05-07T20:25:11.2972730Z Vendor ID: AuthenticAMD 2025-05-07T20:25:11.2972974Z CPU family: 23 2025-05-07T20:25:11.2973191Z Model: 49 2025-05-07T20:25:11.2973411Z Model name: AMD EPYC 7R32 2025-05-07T20:25:11.2973665Z Stepping: 0 2025-05-07T20:25:11.2973878Z CPU MHz: 2799.998 2025-05-07T20:25:11.2974121Z BogoMIPS: 5599.99 2025-05-07T20:25:11.2974356Z Hypervisor vendor: KVM 2025-05-07T20:25:11.2974601Z Virtualization type: full 2025-05-07T20:25:11.2974834Z L1d cache: 32K 2025-05-07T20:25:11.2975150Z L1i cache: 32K 2025-05-07T20:25:11.2975374Z L2 cache: 512K 2025-05-07T20:25:11.2975594Z L3 cache: 16384K 2025-05-07T20:25:11.2975827Z NUMA node0 CPU(s): 0-15 2025-05-07T20:25:11.2977837Z Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid 2025-05-07T20:25:11.2979782Z 2025-05-07T20:25:11.2979885Z Versions of relevant libraries: 2025-05-07T20:25:11.2980146Z [pip3] mypy_extensions==1.1.0 2025-05-07T20:25:11.2980394Z [pip3] numpy==2.0.2 2025-05-07T20:25:11.2980616Z [pip3] nvidia-cublas-cu12==12.8.3.14 2025-05-07T20:25:11.2980912Z [pip3] nvidia-cuda-cupti-cu12==12.8.57 2025-05-07T20:25:11.2981207Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61 2025-05-07T20:25:11.2981511Z [pip3] nvidia-cuda-runtime-cu12==12.8.57 2025-05-07T20:25:11.2981901Z [pip3] nvidia-cudnn-cu12==9.8.0.87 2025-05-07T20:25:11.2982180Z [pip3] nvidia-cufft-cu12==11.3.3.41 2025-05-07T20:25:11.2982453Z [pip3] nvidia-curand-cu12==10.3.9.55 2025-05-07T20:25:11.2982738Z [pip3] nvidia-cusolver-cu12==11.7.2.55 2025-05-07T20:25:11.2983030Z [pip3] nvidia-cusparse-cu12==12.5.7.53 2025-05-07T20:25:11.2983318Z [pip3] nvidia-cusparselt-cu12==0.6.3 2025-05-07T20:25:11.2983601Z [pip3] nvidia-nccl-cu12==2.26.2 2025-05-07T20:25:11.2983884Z [pip3] nvidia-nvjitlink-cu12==12.8.61 2025-05-07T20:25:11.2984173Z [pip3] nvidia-nvtx-cu12==12.8.55 2025-05-07T20:25:11.2984442Z [pip3] pytorch-triton==3.3.0+git96316ce5 2025-05-07T20:25:11.2984739Z [pip3] torch==2.8.0.dev20250507+cu128 2025-05-07T20:25:11.2985471Z [conda] numpy 2.0.2 pypi_0 pypi 2025-05-07T20:25:11.2985926Z [conda] nvidia-cublas-cu12 12.8.3.14 pypi_0 pypi 2025-05-07T20:25:11.2986406Z [conda] nvidia-cuda-cupti-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:25:11.2986891Z [conda] nvidia-cuda-nvrtc-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:25:11.2987383Z [conda] nvidia-cuda-runtime-cu12 12.8.57 pypi_0 pypi 2025-05-07T20:25:11.2987856Z [conda] nvidia-cudnn-cu12 9.8.0.87 pypi_0 pypi 2025-05-07T20:25:11.2988313Z [conda] nvidia-cufft-cu12 11.3.3.41 pypi_0 pypi 2025-05-07T20:25:11.2988777Z [conda] nvidia-curand-cu12 10.3.9.55 pypi_0 pypi 2025-05-07T20:25:11.2989248Z [conda] nvidia-cusolver-cu12 11.7.2.55 pypi_0 pypi 2025-05-07T20:25:11.2989819Z [conda] nvidia-cusparse-cu12 12.5.7.53 pypi_0 pypi 2025-05-07T20:25:11.2990297Z [conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi 2025-05-07T20:25:11.2990768Z [conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi 2025-05-07T20:25:11.2991231Z [conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi 2025-05-07T20:25:11.2991708Z [conda] nvidia-nvtx-cu12 12.8.55 pypi_0 pypi 2025-05-07T20:25:11.2992344Z [conda] pytorch-triton 3.3.0+git96316ce5 pypi_0 pypi 2025-05-07T20:25:11.2992791Z [conda] torch 2.8.0.dev20250507+cu128 pypi_0 pypi 2025-05-07T20:25:11.2993056Z 2025-05-07T20:25:11.3433666Z [NOVA] Time taken to collect PyTorch environment information: 5 seconds 2025-05-07T20:25:11.3434196Z ################################################################################ 2025-05-07T20:25:11.3434501Z # Install FBGEMM-GPU from Wheel 2025-05-07T20:25:11.3434747Z # 2025-05-07T20:25:11.3471924Z # [2025-05-07T20:25:11.346Z] + install_fbgemm_gpu_wheel /__w/_temp/conda_environment_14891846312 fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.3473137Z ################################################################################ 2025-05-07T20:25:11.3473385Z 2025-05-07T20:25:11.3473817Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.3474701Z + sha1sum fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.3475088Z 2025-05-07T20:25:11.3822011Z 603a90ea542be8cba990eaaf5991f74d240b54f7 fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.3823077Z 2025-05-07T20:25:11.3823418Z + sha256sum fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.4635811Z 2025-05-07T20:25:11.4636435Z 27709207cefad2df4ae7ec9c3a55da81178b6955bc77d32b960a1cb5a6eef19d fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.4638585Z 2025-05-07T20:25:11.4639281Z + md5sum fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.4639680Z 2025-05-07T20:25:11.4996397Z 4f0f61dac9f8b96c9bc279e7210d29b9 fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:11.4998918Z 2025-05-07T20:25:11.5007388Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl ... 2025-05-07T20:25:11.5045463Z [EXEC] [ATTEMPT 0/3] + conda run -p /__w/_temp/conda_environment_14891846312 python -m pip install fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:13.3755164Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2025-05-07T20:25:13.3756638Z 2025-05-07T20:25:13.3756977Z Processing ./fbgemm_gpu/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:25:13.3758165Z Requirement already satisfied: numpy in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from fbgemm-gpu-genai==2025.5.7+cu128) (2.0.2) 2025-05-07T20:25:13.3759019Z Installing collected packages: fbgemm-gpu-genai 2025-05-07T20:25:13.3759450Z Successfully installed fbgemm-gpu-genai-2025.5.7+cu128 2025-05-07T20:25:13.3759836Z 2025-05-07T20:25:18.2705470Z ################################################################################ 2025-05-07T20:25:18.2705942Z [CHECK] !!!! INFO !!!! 2025-05-07T20:25:18.2706547Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128 2025-05-07T20:25:18.2707962Z [CHECK] CUDA version reported by PyTorch is: 12.8 2025-05-07T20:25:18.2708319Z [CHECK] 2025-05-07T20:25:18.2708793Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU 2025-05-07T20:25:18.2709410Z [CHECK] package channel; the package may be broken at runtime!!! 2025-05-07T20:25:18.2709867Z ################################################################################ 2025-05-07T20:25:18.2710130Z 2025-05-07T20:25:18.2710349Z [INSTALL] Checking imports and symbols ... 2025-05-07T20:25:21.2700178Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:21.2701795Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:21.2702238Z 2025-05-07T20:25:21.3122966Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:25:24.0017632Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:24.0019218Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:24.0019457Z 2025-05-07T20:25:24.0445885Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'. 2025-05-07T20:25:26.7424069Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:26.7425639Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:26.7425927Z 2025-05-07T20:25:26.7847127Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'. 2025-05-07T20:25:26.7851916Z [CHECK] Printing out the FBGEMM-GPU version ... 2025-05-07T20:25:29.4768743Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:29.4770571Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:29.4770820Z 2025-05-07T20:25:32.2128269Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:32.2129967Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:32.2130219Z 2025-05-07T20:25:34.9593470Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:34.9595292Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:34.9595537Z 2025-05-07T20:25:35.0019602Z ################################################################################ 2025-05-07T20:25:35.0020297Z [CHECK] The installed FBGEMM TARGET is: genai 2025-05-07T20:25:35.0020839Z [CHECK] The installed FBGEMM VARIANT is: cuda 2025-05-07T20:25:35.0021248Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7+cu128 2025-05-07T20:25:35.0021698Z ################################################################################ 2025-05-07T20:25:35.0021991Z 2025-05-07T20:25:37.6942314Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:37.6943906Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:37.6944176Z 2025-05-07T20:25:40.4372706Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:40.4374659Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:40.4374912Z 2025-05-07T20:25:40.4802711Z ################################################################################ 2025-05-07T20:25:40.4803181Z [CHECK] FBGEMM_GPU Experimental Packages 2025-05-07T20:25:40.4806214Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils'] 2025-05-07T20:25:40.4809646Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 2025-05-07T20:25:40.4810847Z ################################################################################ 2025-05-07T20:25:40.4811299Z 2025-05-07T20:25:40.4811762Z [INSTALL] Check for installation of Python sources ... 2025-05-07T20:25:43.1807107Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:43.1808668Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:43.1809057Z 2025-05-07T20:25:43.2235261Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ... 2025-05-07T20:25:45.9217387Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:45.9218929Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:45.9219289Z 2025-05-07T20:25:45.9640926Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ... 2025-05-07T20:25:49.2425997Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:49.2427888Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:49.2428141Z 2025-05-07T20:25:49.2846859Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ... 2025-05-07T20:25:51.9738668Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:51.9740632Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:51.9740886Z 2025-05-07T20:25:52.0170368Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ... 2025-05-07T20:25:52.0173504Z [INSTALL] Check for operator registrations ... 2025-05-07T20:25:54.7050173Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:54.7052040Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:54.7052303Z 2025-05-07T20:25:54.7052497Z fbgemm.nccl_init 2025-05-07T20:25:54.7052645Z 2025-05-07T20:25:54.7475209Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init 2025-05-07T20:25:57.4302821Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:25:57.4304357Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:25:57.4304635Z 2025-05-07T20:25:57.4304881Z fbgemm.gqa_attn_splitk 2025-05-07T20:25:57.4305101Z 2025-05-07T20:25:57.4726498Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk 2025-05-07T20:26:00.1650614Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:26:00.1652380Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:26:00.1652620Z 2025-05-07T20:26:00.1652811Z fbgemm.rope_qkv_decoding 2025-05-07T20:26:00.1652998Z 2025-05-07T20:26:00.2072433Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding 2025-05-07T20:26:00.2073242Z [INSTALL] FBGEMM-GPU installation through wheel completed ... 2025-05-07T20:26:00.2108572Z [NOVA] Time taken to install wheel: 49 seconds 2025-05-07T20:26:02.3090979Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:26:02.3092636Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:26:02.3093455Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:799: UserWarning: Can't initialize NVML 2025-05-07T20:26:02.3094229Z warnings.warn("Can't initialize NVML") 2025-05-07T20:26:02.3138398Z cuda.is_available() False 2025-05-07T20:26:02.3139073Z device_count() 0 2025-05-07T20:26:02.4458888Z 2025-05-07T20:26:02.4459385Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-05-07T20:26:02.4459757Z WARNING: 2025-05-07T20:26:02.4460118Z 2025-05-07T20:26:02.4460365Z You should always run with libnvidia-ml.so that is installed with your 2025-05-07T20:26:02.4461038Z NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. 2025-05-07T20:26:02.4462077Z libnvidia-ml.so in GDK package is a stub library that is attached only for 2025-05-07T20:26:02.4462787Z build purposes (e.g. machine that you build your application doesn't have 2025-05-07T20:26:02.4463264Z to have Display Driver installed). 2025-05-07T20:26:02.4463706Z !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 2025-05-07T20:26:02.6365484Z ################################################################################ 2025-05-07T20:26:02.6365915Z # Test All FBGEMM-GPU Modules 2025-05-07T20:26:02.6366410Z # 2025-05-07T20:26:02.6401789Z # [2025-05-07T20:26:02.639Z] + test_all_fbgemm_gpu_modules /__w/_temp/conda_environment_14891846312 2025-05-07T20:26:02.6402657Z ################################################################################ 2025-05-07T20:26:02.6402897Z 2025-05-07T20:26:05.3313975Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:26:05.3315634Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:26:05.3315862Z 2025-05-07T20:26:08.0755528Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:26:08.0757184Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:26:08.0757424Z 2025-05-07T20:26:08.1180573Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda) 2025-05-07T20:26:08.1181569Z [TEST] Will be running tests specific to this target and variant ... 2025-05-07T20:26:08.1182380Z [TEST] Determined the test directories: 2025-05-07T20:26:08.1182775Z fbgemm_gpu/experimental/gen_ai/test 2025-05-07T20:26:08.1183217Z fbgemm_gpu/experimental/example/test 2025-05-07T20:26:08.1183564Z fbgemm_gpu/experimental/gemm/test 2025-05-07T20:26:08.1183832Z 2025-05-07T20:26:08.1190844Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ... 2025-05-07T20:26:08.1201096Z [TEST] Set environment variables for CUDA testing ... 2025-05-07T20:26:08.1201834Z + conda env config vars unset -p /__w/_temp/conda_environment_14891846312 CUDA_VISIBLE_DEVICES 2025-05-07T20:26:08.1202314Z 2025-05-07T20:26:08.5947531Z To make your changes take effect please reactivate your environment 2025-05-07T20:26:08.6368109Z 2025-05-07T20:26:08.6368998Z [TEST] Installing PyTest ... 2025-05-07T20:26:08.6409680Z [EXEC] [ATTEMPT 0/3] + conda install -p /__w/_temp/conda_environment_14891846312 -c conda-forge --override-channels -y pytest expecttest 2025-05-07T20:26:32.1850685Z Collecting package metadata (current_repodata.json): ...working... done 2025-05-07T20:26:33.7754440Z Solving environment: ...working... done 2025-05-07T20:26:34.1697012Z 2025-05-07T20:26:34.1697234Z ## Package Plan ## 2025-05-07T20:26:34.1697610Z 2025-05-07T20:26:34.1697837Z environment location: /__w/_temp/conda_environment_14891846312 2025-05-07T20:26:34.1705714Z 2025-05-07T20:26:34.1705826Z added / updated specs: 2025-05-07T20:26:34.1706079Z - expecttest 2025-05-07T20:26:34.1706281Z - pytest 2025-05-07T20:26:34.1706395Z 2025-05-07T20:26:34.1706408Z 2025-05-07T20:26:34.1706527Z The following packages will be downloaded: 2025-05-07T20:26:34.1706915Z 2025-05-07T20:26:34.1707033Z package | build 2025-05-07T20:26:34.1707346Z ---------------------------|----------------- 2025-05-07T20:26:34.1707717Z colorama-0.4.6 | pyhd8ed1ab_1 26 KB conda-forge 2025-05-07T20:26:34.1708145Z expecttest-0.3.0 | pyhd8ed1ab_0 14 KB conda-forge 2025-05-07T20:26:34.1708599Z iniconfig-2.0.0 | pyhd8ed1ab_1 11 KB conda-forge 2025-05-07T20:26:34.1709004Z pluggy-1.5.0 | pyhd8ed1ab_1 23 KB conda-forge 2025-05-07T20:26:34.1709408Z pytest-8.3.5 | pyhd8ed1ab_0 254 KB conda-forge 2025-05-07T20:26:34.1709835Z ------------------------------------------------------------ 2025-05-07T20:26:34.1710182Z Total: 329 KB 2025-05-07T20:26:34.1710469Z 2025-05-07T20:26:34.1710643Z The following NEW packages will be INSTALLED: 2025-05-07T20:26:34.1710864Z 2025-05-07T20:26:34.1711061Z colorama conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 2025-05-07T20:26:34.1711634Z expecttest conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 2025-05-07T20:26:34.1712183Z iniconfig conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 2025-05-07T20:26:34.1712628Z pluggy conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 2025-05-07T20:26:34.1713086Z pytest conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 2025-05-07T20:26:34.1713419Z 2025-05-07T20:26:34.1713423Z 2025-05-07T20:26:34.1713428Z 2025-05-07T20:26:34.1713535Z Downloading and Extracting Packages 2025-05-07T20:26:34.1713740Z 2025-05-07T20:26:34.1714419Z pluggy-1.5.0 | 23 KB | | 0% 2025-05-07T20:26:34.1714979Z 2025-05-07T20:26:34.1733863Z iniconfig-2.0.0 | 11 KB | | 0%  2025-05-07T20:26:34.1734126Z 2025-05-07T20:26:34.1734210Z 2025-05-07T20:26:34.1746937Z colorama-0.4.6 | 26 KB | | 0%  2025-05-07T20:26:34.1747230Z 2025-05-07T20:26:34.1747393Z 2025-05-07T20:26:34.1747435Z 2025-05-07T20:26:34.1765254Z pytest-8.3.5 | 254 KB | | 0%  2025-05-07T20:26:34.1765507Z 2025-05-07T20:26:34.1765511Z 2025-05-07T20:26:34.1765514Z 2025-05-07T20:26:34.1778895Z 2025-05-07T20:26:34.3238233Z expecttest-0.3.0 | 14 KB | | 0%  2025-05-07T20:26:34.3238514Z 2025-05-07T20:26:34.3296984Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:26:34.3297325Z 2025-05-07T20:26:34.3297332Z 2025-05-07T20:26:34.3401655Z colorama-0.4.6 | 26 KB | ###### | 61%  2025-05-07T20:26:34.3401924Z 2025-05-07T20:26:34.3401929Z 2025-05-07T20:26:34.3401933Z 2025-05-07T20:26:34.3401937Z 2025-05-07T20:26:34.3473796Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:26:34.3475908Z 2025-05-07T20:26:34.3578723Z iniconfig-2.0.0 | 11 KB | ########## | 100%  2025-05-07T20:26:34.3579331Z 2025-05-07T20:26:34.3579504Z 2025-05-07T20:26:34.3579519Z 2025-05-07T20:26:34.3579591Z 2025-05-07T20:26:34.3592302Z expecttest-0.3.0 | 14 KB | ########## | 100%  2025-05-07T20:26:34.3592592Z 2025-05-07T20:26:34.3592598Z 2025-05-07T20:26:34.3595032Z colorama-0.4.6 | 26 KB | ########## | 100%  2025-05-07T20:26:34.3708678Z pluggy-1.5.0 | 23 KB | ######9 | 69% 2025-05-07T20:26:34.3845790Z pluggy-1.5.0 | 23 KB | ########## | 100% 2025-05-07T20:26:34.3846032Z 2025-05-07T20:26:34.3846039Z 2025-05-07T20:26:34.3846045Z 2025-05-07T20:26:34.4172423Z pytest-8.3.5 | 254 KB | 6 | 6%  2025-05-07T20:26:34.4172912Z 2025-05-07T20:26:34.4172922Z 2025-05-07T20:26:34.4172931Z 2025-05-07T20:26:34.4215924Z pytest-8.3.5 | 254 KB | ########## | 100%  2025-05-07T20:26:34.4216448Z 2025-05-07T20:26:34.4216752Z 2025-05-07T20:26:34.4217275Z  2025-05-07T20:26:34.4217593Z 2025-05-07T20:26:34.4217600Z 2025-05-07T20:26:34.4217842Z  2025-05-07T20:26:34.4218146Z 2025-05-07T20:26:34.4218152Z 2025-05-07T20:26:34.4218158Z 2025-05-07T20:26:34.4218399Z  2025-05-07T20:26:34.4218722Z 2025-05-07T20:26:34.4218728Z 2025-05-07T20:26:34.4218734Z 2025-05-07T20:26:34.4218740Z 2025-05-07T20:26:34.4218985Z  2025-05-07T20:26:34.4307228Z Preparing transaction: ...working... done 2025-05-07T20:26:34.4422333Z Verifying transaction: ...working... done 2025-05-07T20:26:35.3938485Z Executing transaction: ...working... done 2025-05-07T20:26:36.8519351Z [TEST] Checking imports ... 2025-05-07T20:26:39.5515089Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:26:39.5517011Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:26:39.5517227Z 2025-05-07T20:26:39.5939484Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ... 2025-05-07T20:26:39.5954173Z [TEST] Setting feature flags ... 2025-05-07T20:26:39.5954835Z + conda env config vars set -p /__w/_temp/conda_environment_14891846312 FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1 2025-05-07T20:26:39.5955253Z 2025-05-07T20:26:40.0693535Z To make your changes take effect please reactivate your environment 2025-05-07T20:26:40.1110309Z 2025-05-07T20:26:40.1110817Z [TEST] PyTest args: -v -rsx -s -W ignore::pytest.PytestCollectionWarning 2025-05-07T20:26:40.1112038Z ################################################################################ 2025-05-07T20:26:40.1112357Z # Run FBGEMM-GPU Tests: 2025-05-07T20:26:40.1148205Z # 2025-05-07T20:26:40.1148650Z # [2025-05-07T20:26:40.114Z] + __run_fbgemm_gpu_tests_in_directory /__w/_temp/conda_environment_14891846312 2025-05-07T20:26:40.1149215Z ################################################################################ 2025-05-07T20:26:40.1149425Z 2025-05-07T20:26:40.1158432Z [TEST] Enumerating ALL test files ... 2025-05-07T20:26:40.1202874Z ./attention/gqa_test.py 2025-05-07T20:26:40.1203202Z ./coalesce/coalesce_test.py 2025-05-07T20:26:40.1203580Z ./comm/multi_gpu_car_test.py 2025-05-07T20:26:40.1203851Z ./gather_scatter/gather_scatter_test.py 2025-05-07T20:26:40.1204131Z ./kv_cache/kv_cache_test.py 2025-05-07T20:26:40.1204376Z ./moe/activation_test.py 2025-05-07T20:26:40.1204608Z ./moe/gather_scatter_test.py 2025-05-07T20:26:40.1204853Z ./moe/layers_test.py 2025-05-07T20:26:40.1205064Z ./moe/shuffling_test.py 2025-05-07T20:26:40.1205294Z ./quantize/quantize_test.py 2025-05-07T20:26:40.1205460Z 2025-05-07T20:26:40.1205571Z [TEST] Enumerating IGNORED test files ... 2025-05-07T20:26:40.1205785Z 2025-05-07T20:26:40.1229031Z ################################################################################ 2025-05-07T20:26:40.1261092Z # [2025-05-07T20:26:40.125Z] Run Python Test Suite: 2025-05-07T20:26:40.1261555Z # ./attention/gqa_test.py 2025-05-07T20:26:40.1261828Z ################################################################################ 2025-05-07T20:26:40.1303494Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py 2025-05-07T20:26:40.1304200Z 2025-05-07T20:26:41.7208279Z ============================= test session starts ============================== 2025-05-07T20:26:41.7209119Z platform linux -- Python 3.9.22, pytest-8.3.5, pluggy-1.5.0 -- /__w/_temp/conda_environment_14891846312/bin/python 2025-05-07T20:26:41.7209788Z cachedir: .pytest_cache 2025-05-07T20:26:41.7210810Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:26:41.7211477Z rootdir: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:26:41.7211808Z plugins: hypothesis-6.131.14 2025-05-07T20:26:43.1232849Z collecting ... collected 2 items 2025-05-07T20:26:43.1233079Z 2025-05-07T20:26:43.1246707Z attention/gqa_test.py::Int4GQATest::test_gqa SKIPPED (Skip when CUDA...) 2025-05-07T20:26:43.2467815Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...) 2025-05-07T20:26:43.2468276Z 2025-05-07T20:26:43.2468459Z =============================== warnings summary =============================== 2025-05-07T20:26:43.2469143Z ../../../../../../../../_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181 2025-05-07T20:26:43.2470927Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:26:43.2472758Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:26:43.2472973Z 2025-05-07T20:26:43.2473202Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:26:43.2473653Z =========================== short test summary info ============================ 2025-05-07T20:26:43.2474214Z SKIPPED [1] attention/gqa_test.py:146: Skip when CUDA is not available or CUDA compute capability is less than 8 2025-05-07T20:26:43.2475093Z SKIPPED [1] ../../../../../../../../_temp/conda_environment_14891846312/lib/python3.9/unittest/case.py:117: Skip when CUDA is not available or xformers is not available 2025-05-07T20:26:43.2475795Z ======================== 2 skipped, 1 warning in 2.03s ========================= 2025-05-07T20:26:43.6688131Z 2025-05-07T20:26:43.6688400Z [TEST] Python test suite PASSED: ./attention/gqa_test.py 2025-05-07T20:26:43.6723573Z [TEST] Python test time for ./attention/gqa_test.py: 3 seconds 2025-05-07T20:26:43.6723950Z 2025-05-07T20:26:43.6723957Z 2025-05-07T20:26:43.6723962Z 2025-05-07T20:26:43.6723967Z 2025-05-07T20:26:43.6751541Z ################################################################################ 2025-05-07T20:26:43.6782857Z # [2025-05-07T20:26:43.677Z] Run Python Test Suite: 2025-05-07T20:26:43.6783268Z # ./coalesce/coalesce_test.py 2025-05-07T20:26:43.6783606Z ################################################################################ 2025-05-07T20:26:43.6830864Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py 2025-05-07T20:26:43.6831640Z 2025-05-07T20:26:44.8852341Z ============================= test session starts ============================== 2025-05-07T20:26:44.8853192Z platform linux -- Python 3.9.22, pytest-8.3.5, pluggy-1.5.0 -- /__w/_temp/conda_environment_14891846312/bin/python 2025-05-07T20:26:44.8853812Z cachedir: .pytest_cache 2025-05-07T20:26:44.8854383Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:26:44.8855062Z rootdir: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:26:44.8855397Z plugins: hypothesis-6.131.14 2025-05-07T20:26:46.3344452Z collecting ... collected 1 item 2025-05-07T20:26:46.3344646Z 2025-05-07T20:27:01.2669591Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches FAILED 2025-05-07T20:27:01.2670096Z 2025-05-07T20:27:01.2670250Z =================================== FAILURES =================================== 2025-05-07T20:27:01.2670795Z ______________________ CoalesceTest.test_coalesce_batches ______________________ 2025-05-07T20:27:01.2671449Z 2025-05-07T20:27:01.2671653Z self = 2025-05-07T20:27:01.2672062Z 2025-05-07T20:27:01.2672136Z @given( 2025-05-07T20:27:01.2672450Z > device=st.sampled_from([torch.device("cpu"), torch.device("cuda")]), 2025-05-07T20:27:01.2672893Z batch_size=st.integers(min_value=10, max_value=5000), 2025-05-07T20:27:01.2673287Z num_inputs=st.integers(min_value=1, max_value=50), 2025-05-07T20:27:01.2673597Z ) 2025-05-07T20:27:01.2673702Z 2025-05-07T20:27:01.2673803Z coalesce/coalesce_test.py:22: 2025-05-07T20:27:01.2674102Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:01.2674476Z coalesce/coalesce_test.py:37: in test_coalesce_batches 2025-05-07T20:27:01.2674843Z new_bids = torch.tensor(new_bids).to(device) 2025-05-07T20:27:01.2675169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:01.2675396Z 2025-05-07T20:27:01.2675480Z def _lazy_init(): 2025-05-07T20:27:01.2675718Z global _initialized, _queued_calls 2025-05-07T20:27:01.2676171Z if is_initialized() or hasattr(_tls, "is_initializing"): 2025-05-07T20:27:01.2676496Z return 2025-05-07T20:27:01.2676714Z with _initialization_lock: 2025-05-07T20:27:01.2677050Z # We be double-checked locking, boys! This is OK because 2025-05-07T20:27:01.2677458Z # the above test was GIL protected anyway. The inner test 2025-05-07T20:27:01.2677876Z # is for when a thread blocked on some other thread which was 2025-05-07T20:27:01.2678301Z # doing the initialization; when they get the lock, they will 2025-05-07T20:27:01.2678686Z # find there is nothing left to do. 2025-05-07T20:27:01.2678981Z if is_initialized(): 2025-05-07T20:27:01.2679220Z return 2025-05-07T20:27:01.2679541Z # It is important to prevent other threads from entering _lazy_init 2025-05-07T20:27:01.2680034Z # immediately, while we are still guaranteed to have the GIL, because some 2025-05-07T20:27:01.2680489Z # of the C calls we make below will release the GIL 2025-05-07T20:27:01.2680807Z if _is_in_bad_fork(): 2025-05-07T20:27:01.2681070Z raise RuntimeError( 2025-05-07T20:27:01.2681625Z "Cannot re-initialize CUDA in forked subprocess. To use CUDA with " 2025-05-07T20:27:01.2682099Z "multiprocessing, you must use the 'spawn' start method" 2025-05-07T20:27:01.2682443Z ) 2025-05-07T20:27:01.2682695Z if not hasattr(torch._C, "_cuda_getDeviceCount"): 2025-05-07T20:27:01.2683110Z raise AssertionError("Torch not compiled with CUDA enabled") 2025-05-07T20:27:01.2683470Z if _cudart is None: 2025-05-07T20:27:01.2683733Z raise AssertionError( 2025-05-07T20:27:01.2684134Z "libcudart functions unavailable. It looks like you have a broken build?" 2025-05-07T20:27:01.2684535Z ) 2025-05-07T20:27:01.2685048Z # This function throws if there's a driver initialization error, no GPUs 2025-05-07T20:27:01.2685462Z # are found or any other error occurs 2025-05-07T20:27:01.2685788Z if "CUDA_MODULE_LOADING" not in os.environ: 2025-05-07T20:27:01.2686117Z os.environ["CUDA_MODULE_LOADING"] = "LAZY" 2025-05-07T20:27:01.2686428Z > torch._C._cuda_init() 2025-05-07T20:27:01.2687251Z E RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library 2025-05-07T20:27:01.2688126Z E Falsifying example: test_coalesce_batches( 2025-05-07T20:27:01.2688539Z E # The test always failed when commented parts were varied together. 2025-05-07T20:27:01.2689027Z E self=, 2025-05-07T20:27:01.2689510Z E device=device(type='cuda'), 2025-05-07T20:27:01.2689833Z E batch_size=10, # or any other generated value 2025-05-07T20:27:01.2690192Z E num_inputs=1, # or any other generated value 2025-05-07T20:27:01.2690484Z E ) 2025-05-07T20:27:01.2690683Z E Explanation: 2025-05-07T20:27:01.2691001Z E These lines were always and only run by failing examples: 2025-05-07T20:27:01.2691557Z E /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:354 2025-05-07T20:27:01.2692023Z E 2025-05-07T20:27:01.2692539Z E You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQpBAQ==') as a decorator on your test case 2025-05-07T20:27:01.2693046Z 2025-05-07T20:27:01.2693379Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:382: RuntimeError 2025-05-07T20:27:01.2693943Z =============================== warnings summary =============================== 2025-05-07T20:27:01.2694536Z ../../../../../../../../_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181 2025-05-07T20:27:01.2696198Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:27:01.2697689Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:27:01.2697902Z 2025-05-07T20:27:01.2698125Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:27:01.2698579Z ======================== 1 failed, 1 warning in 16.51s ========================= 2025-05-07T20:27:01.7102801Z ERROR conda.cli.main_run:execute(47): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py` failed. (See above for error) 2025-05-07T20:27:01.7532184Z 2025-05-07T20:27:01.7532477Z [TEST] Some tests FAILED. Re-attempting only FAILED tests: ./coalesce/coalesce_test.py 2025-05-07T20:27:01.7532843Z 2025-05-07T20:27:01.7532847Z 2025-05-07T20:27:01.7571126Z [EXEC] [ATTEMPT 0/2] + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./coalesce/coalesce_test.py 2025-05-07T20:27:02.9640163Z ============================= test session starts ============================== 2025-05-07T20:27:02.9640924Z platform linux -- Python 3.9.22, pytest-8.3.5, pluggy-1.5.0 -- /__w/_temp/conda_environment_14891846312/bin/python 2025-05-07T20:27:02.9641420Z cachedir: .pytest_cache 2025-05-07T20:27:02.9641992Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:27:02.9642688Z rootdir: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:27:02.9643032Z plugins: hypothesis-6.131.14 2025-05-07T20:27:04.4152328Z collecting ... collected 1 item 2025-05-07T20:27:04.4152785Z run-last-failure: rerun previous 1 failure 2025-05-07T20:27:04.4153039Z 2025-05-07T20:27:19.3573590Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches FAILED 2025-05-07T20:27:19.3574032Z 2025-05-07T20:27:19.3574154Z =================================== FAILURES =================================== 2025-05-07T20:27:19.3574574Z ______________________ CoalesceTest.test_coalesce_batches ______________________ 2025-05-07T20:27:19.3574863Z 2025-05-07T20:27:19.3575057Z self = 2025-05-07T20:27:19.3575357Z 2025-05-07T20:27:19.3575428Z @given( 2025-05-07T20:27:19.3575735Z > device=st.sampled_from([torch.device("cpu"), torch.device("cuda")]), 2025-05-07T20:27:19.3576369Z batch_size=st.integers(min_value=10, max_value=5000), 2025-05-07T20:27:19.3576850Z num_inputs=st.integers(min_value=1, max_value=50), 2025-05-07T20:27:19.3577257Z ) 2025-05-07T20:27:19.3577406Z 2025-05-07T20:27:19.3577539Z coalesce/coalesce_test.py:22: 2025-05-07T20:27:19.3577930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:19.3578351Z coalesce/coalesce_test.py:37: in test_coalesce_batches 2025-05-07T20:27:19.3578726Z new_bids = torch.tensor(new_bids).to(device) 2025-05-07T20:27:19.3579053Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:19.3579275Z 2025-05-07T20:27:19.3579360Z def _lazy_init(): 2025-05-07T20:27:19.3579598Z global _initialized, _queued_calls 2025-05-07T20:27:19.3579950Z if is_initialized() or hasattr(_tls, "is_initializing"): 2025-05-07T20:27:19.3580263Z return 2025-05-07T20:27:19.3580494Z with _initialization_lock: 2025-05-07T20:27:19.3580819Z # We be double-checked locking, boys! This is OK because 2025-05-07T20:27:19.3581354Z # the above test was GIL protected anyway. The inner test 2025-05-07T20:27:19.3581770Z # is for when a thread blocked on some other thread which was 2025-05-07T20:27:19.3582203Z # doing the initialization; when they get the lock, they will 2025-05-07T20:27:19.3582579Z # find there is nothing left to do. 2025-05-07T20:27:19.3582865Z if is_initialized(): 2025-05-07T20:27:19.3583106Z return 2025-05-07T20:27:19.3583416Z # It is important to prevent other threads from entering _lazy_init 2025-05-07T20:27:19.3583912Z # immediately, while we are still guaranteed to have the GIL, because some 2025-05-07T20:27:19.3584367Z # of the C calls we make below will release the GIL 2025-05-07T20:27:19.3584684Z if _is_in_bad_fork(): 2025-05-07T20:27:19.3585250Z raise RuntimeError( 2025-05-07T20:27:19.3585636Z "Cannot re-initialize CUDA in forked subprocess. To use CUDA with " 2025-05-07T20:27:19.3586109Z "multiprocessing, you must use the 'spawn' start method" 2025-05-07T20:27:19.3586442Z ) 2025-05-07T20:27:19.3586701Z if not hasattr(torch._C, "_cuda_getDeviceCount"): 2025-05-07T20:27:19.3587273Z raise AssertionError("Torch not compiled with CUDA enabled") 2025-05-07T20:27:19.3587650Z if _cudart is None: 2025-05-07T20:27:19.3587910Z raise AssertionError( 2025-05-07T20:27:19.3588313Z "libcudart functions unavailable. It looks like you have a broken build?" 2025-05-07T20:27:19.3588712Z ) 2025-05-07T20:27:19.3589033Z # This function throws if there's a driver initialization error, no GPUs 2025-05-07T20:27:19.3589450Z # are found or any other error occurs 2025-05-07T20:27:19.3589771Z if "CUDA_MODULE_LOADING" not in os.environ: 2025-05-07T20:27:19.3590113Z os.environ["CUDA_MODULE_LOADING"] = "LAZY" 2025-05-07T20:27:19.3590425Z > torch._C._cuda_init() 2025-05-07T20:27:19.3591269Z E RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library 2025-05-07T20:27:19.3592320Z E Falsifying example: test_coalesce_batches( 2025-05-07T20:27:19.3592729Z E # The test always failed when commented parts were varied together. 2025-05-07T20:27:19.3593227Z E self=, 2025-05-07T20:27:19.3593631Z E device=device(type='cuda'), 2025-05-07T20:27:19.3593958Z E batch_size=10, # or any other generated value 2025-05-07T20:27:19.3594314Z E num_inputs=1, # or any other generated value 2025-05-07T20:27:19.3594681Z E ) 2025-05-07T20:27:19.3594881Z E Explanation: 2025-05-07T20:27:19.3595184Z E These lines were always and only run by failing examples: 2025-05-07T20:27:19.3595748Z E /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:354 2025-05-07T20:27:19.3596206Z E 2025-05-07T20:27:19.3596736Z E You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQpBAQ==') as a decorator on your test case 2025-05-07T20:27:19.3597244Z 2025-05-07T20:27:19.3597583Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:382: RuntimeError 2025-05-07T20:27:19.3598140Z =============================== warnings summary =============================== 2025-05-07T20:27:19.3598669Z ../../../../../../../../_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181 2025-05-07T20:27:19.3600347Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:27:19.3601930Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:27:19.3602149Z 2025-05-07T20:27:19.3602367Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:27:19.3602813Z ======================== 1 failed, 1 warning in 16.53s ========================= 2025-05-07T20:27:19.7977352Z ERROR conda.cli.main_run:execute(47): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./coalesce/coalesce_test.py` failed. (See above for error) 2025-05-07T20:27:19.8406542Z [EXEC] [ATTEMPT 0/2] Command attempt failed. 2025-05-07T20:27:19.8406858Z 2025-05-07T20:27:21.8449924Z [EXEC] [ATTEMPT 1/2] + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./coalesce/coalesce_test.py 2025-05-07T20:27:23.0482300Z ============================= test session starts ============================== 2025-05-07T20:27:23.0483438Z platform linux -- Python 3.9.22, pytest-8.3.5, pluggy-1.5.0 -- /__w/_temp/conda_environment_14891846312/bin/python 2025-05-07T20:27:23.0483946Z cachedir: .pytest_cache 2025-05-07T20:27:23.0484511Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:27:23.0485483Z rootdir: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:27:23.0485824Z plugins: hypothesis-6.131.14 2025-05-07T20:27:24.4900698Z collecting ... collected 1 item 2025-05-07T20:27:24.4901150Z run-last-failure: rerun previous 1 failure 2025-05-07T20:27:24.4901470Z 2025-05-07T20:27:39.3771733Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches FAILED 2025-05-07T20:27:39.3772264Z 2025-05-07T20:27:39.3772390Z =================================== FAILURES =================================== 2025-05-07T20:27:39.3772798Z ______________________ CoalesceTest.test_coalesce_batches ______________________ 2025-05-07T20:27:39.3773076Z 2025-05-07T20:27:39.3773284Z self = 2025-05-07T20:27:39.3773580Z 2025-05-07T20:27:39.3773656Z @given( 2025-05-07T20:27:39.3774057Z > device=st.sampled_from([torch.device("cpu"), torch.device("cuda")]), 2025-05-07T20:27:39.3774587Z batch_size=st.integers(min_value=10, max_value=5000), 2025-05-07T20:27:39.3774968Z num_inputs=st.integers(min_value=1, max_value=50), 2025-05-07T20:27:39.3775343Z ) 2025-05-07T20:27:39.3775486Z 2025-05-07T20:27:39.3775638Z coalesce/coalesce_test.py:22: 2025-05-07T20:27:39.3776232Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:39.3776753Z coalesce/coalesce_test.py:37: in test_coalesce_batches 2025-05-07T20:27:39.3777236Z new_bids = torch.tensor(new_bids).to(device) 2025-05-07T20:27:39.3777582Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:39.3777805Z 2025-05-07T20:27:39.3777895Z def _lazy_init(): 2025-05-07T20:27:39.3778135Z global _initialized, _queued_calls 2025-05-07T20:27:39.3778483Z if is_initialized() or hasattr(_tls, "is_initializing"): 2025-05-07T20:27:39.3778828Z return 2025-05-07T20:27:39.3779049Z with _initialization_lock: 2025-05-07T20:27:39.3779373Z # We be double-checked locking, boys! This is OK because 2025-05-07T20:27:39.3779780Z # the above test was GIL protected anyway. The inner test 2025-05-07T20:27:39.3780197Z # is for when a thread blocked on some other thread which was 2025-05-07T20:27:39.3780622Z # doing the initialization; when they get the lock, they will 2025-05-07T20:27:39.3781103Z # find there is nothing left to do. 2025-05-07T20:27:39.3781390Z if is_initialized(): 2025-05-07T20:27:39.3781634Z return 2025-05-07T20:27:39.3781995Z # It is important to prevent other threads from entering _lazy_init 2025-05-07T20:27:39.3782496Z # immediately, while we are still guaranteed to have the GIL, because some 2025-05-07T20:27:39.3782947Z # of the C calls we make below will release the GIL 2025-05-07T20:27:39.3783260Z if _is_in_bad_fork(): 2025-05-07T20:27:39.3783520Z raise RuntimeError( 2025-05-07T20:27:39.3783895Z "Cannot re-initialize CUDA in forked subprocess. To use CUDA with " 2025-05-07T20:27:39.3784370Z "multiprocessing, you must use the 'spawn' start method" 2025-05-07T20:27:39.3784703Z ) 2025-05-07T20:27:39.3785268Z if not hasattr(torch._C, "_cuda_getDeviceCount"): 2025-05-07T20:27:39.3785687Z raise AssertionError("Torch not compiled with CUDA enabled") 2025-05-07T20:27:39.3786048Z if _cudart is None: 2025-05-07T20:27:39.3786309Z raise AssertionError( 2025-05-07T20:27:39.3786710Z "libcudart functions unavailable. It looks like you have a broken build?" 2025-05-07T20:27:39.3787279Z ) 2025-05-07T20:27:39.3787608Z # This function throws if there's a driver initialization error, no GPUs 2025-05-07T20:27:39.3788024Z # are found or any other error occurs 2025-05-07T20:27:39.3788350Z if "CUDA_MODULE_LOADING" not in os.environ: 2025-05-07T20:27:39.3788682Z os.environ["CUDA_MODULE_LOADING"] = "LAZY" 2025-05-07T20:27:39.3788988Z > torch._C._cuda_init() 2025-05-07T20:27:39.3789807Z E RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library 2025-05-07T20:27:39.3790693Z E Falsifying example: test_coalesce_batches( 2025-05-07T20:27:39.3791125Z E # The test always failed when commented parts were varied together. 2025-05-07T20:27:39.3791734Z E self=, 2025-05-07T20:27:39.3792144Z E device=device(type='cuda'), 2025-05-07T20:27:39.3792466Z E batch_size=10, # or any other generated value 2025-05-07T20:27:39.3792822Z E num_inputs=1, # or any other generated value 2025-05-07T20:27:39.3793116Z E ) 2025-05-07T20:27:39.3793318Z E Explanation: 2025-05-07T20:27:39.3793622Z E These lines were always and only run by failing examples: 2025-05-07T20:27:39.3794182Z E /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:354 2025-05-07T20:27:39.3794724Z E 2025-05-07T20:27:39.3795244Z E You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQpBAQ==') as a decorator on your test case 2025-05-07T20:27:39.3795752Z 2025-05-07T20:27:39.3796084Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:382: RuntimeError 2025-05-07T20:27:39.3796641Z =============================== warnings summary =============================== 2025-05-07T20:27:39.3797169Z ../../../../../../../../_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181 2025-05-07T20:27:39.3798829Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:27:39.3800297Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:27:39.3800587Z 2025-05-07T20:27:39.3800804Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:27:39.3801250Z ======================== 1 failed, 1 warning in 16.46s ========================= 2025-05-07T20:27:39.8224053Z ERROR conda.cli.main_run:execute(47): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./coalesce/coalesce_test.py` failed. (See above for error) 2025-05-07T20:27:39.8645112Z [EXEC] [ATTEMPT 1/2] Command attempt failed. 2025-05-07T20:27:39.8645342Z 2025-05-07T20:27:41.8680351Z [EXEC] [ATTEMPT 2/2] + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./coalesce/coalesce_test.py 2025-05-07T20:27:43.0696744Z ============================= test session starts ============================== 2025-05-07T20:27:43.0697408Z platform linux -- Python 3.9.22, pytest-8.3.5, pluggy-1.5.0 -- /__w/_temp/conda_environment_14891846312/bin/python 2025-05-07T20:27:43.0697900Z cachedir: .pytest_cache 2025-05-07T20:27:43.0698472Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,) 2025-05-07T20:27:43.0699483Z rootdir: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/fbgemm_gpu 2025-05-07T20:27:43.0699833Z plugins: hypothesis-6.131.14 2025-05-07T20:27:44.5186448Z collecting ... collected 1 item 2025-05-07T20:27:44.5186812Z run-last-failure: rerun previous 1 failure 2025-05-07T20:27:44.5187029Z 2025-05-07T20:27:59.1893685Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches FAILED 2025-05-07T20:27:59.1901633Z 2025-05-07T20:27:59.1901841Z =================================== FAILURES =================================== 2025-05-07T20:27:59.1902267Z ______________________ CoalesceTest.test_coalesce_batches ______________________ 2025-05-07T20:27:59.1902585Z 2025-05-07T20:27:59.1902794Z self = 2025-05-07T20:27:59.1903087Z 2025-05-07T20:27:59.1903168Z @given( 2025-05-07T20:27:59.1903475Z > device=st.sampled_from([torch.device("cpu"), torch.device("cuda")]), 2025-05-07T20:27:59.1903942Z batch_size=st.integers(min_value=10, max_value=5000), 2025-05-07T20:27:59.1904317Z num_inputs=st.integers(min_value=1, max_value=50), 2025-05-07T20:27:59.1904622Z ) 2025-05-07T20:27:59.1904728Z 2025-05-07T20:27:59.1904831Z coalesce/coalesce_test.py:22: 2025-05-07T20:27:59.1905133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:59.1905513Z coalesce/coalesce_test.py:37: in test_coalesce_batches 2025-05-07T20:27:59.1905875Z new_bids = torch.tensor(new_bids).to(device) 2025-05-07T20:27:59.1906220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 2025-05-07T20:27:59.1906636Z 2025-05-07T20:27:59.1906718Z def _lazy_init(): 2025-05-07T20:27:59.1906975Z global _initialized, _queued_calls 2025-05-07T20:27:59.1907322Z if is_initialized() or hasattr(_tls, "is_initializing"): 2025-05-07T20:27:59.1907648Z return 2025-05-07T20:27:59.1907875Z with _initialization_lock: 2025-05-07T20:27:59.1908205Z # We be double-checked locking, boys! This is OK because 2025-05-07T20:27:59.1908670Z # the above test was GIL protected anyway. The inner test 2025-05-07T20:27:59.1909079Z # is for when a thread blocked on some other thread which was 2025-05-07T20:27:59.1909510Z # doing the initialization; when they get the lock, they will 2025-05-07T20:27:59.1909879Z # find there is nothing left to do. 2025-05-07T20:27:59.1910176Z if is_initialized(): 2025-05-07T20:27:59.1910429Z return 2025-05-07T20:27:59.1910742Z # It is important to prevent other threads from entering _lazy_init 2025-05-07T20:27:59.1911345Z # immediately, while we are still guaranteed to have the GIL, because some 2025-05-07T20:27:59.1911792Z # of the C calls we make below will release the GIL 2025-05-07T20:27:59.1912249Z if _is_in_bad_fork(): 2025-05-07T20:27:59.1912504Z raise RuntimeError( 2025-05-07T20:27:59.1912894Z "Cannot re-initialize CUDA in forked subprocess. To use CUDA with " 2025-05-07T20:27:59.1913359Z "multiprocessing, you must use the 'spawn' start method" 2025-05-07T20:27:59.1913701Z ) 2025-05-07T20:27:59.1913962Z if not hasattr(torch._C, "_cuda_getDeviceCount"): 2025-05-07T20:27:59.1914377Z raise AssertionError("Torch not compiled with CUDA enabled") 2025-05-07T20:27:59.1914747Z if _cudart is None: 2025-05-07T20:27:59.1914995Z raise AssertionError( 2025-05-07T20:27:59.1915404Z "libcudart functions unavailable. It looks like you have a broken build?" 2025-05-07T20:27:59.1915801Z ) 2025-05-07T20:27:59.1916126Z # This function throws if there's a driver initialization error, no GPUs 2025-05-07T20:27:59.1916539Z # are found or any other error occurs 2025-05-07T20:27:59.1916857Z if "CUDA_MODULE_LOADING" not in os.environ: 2025-05-07T20:27:59.1917358Z os.environ["CUDA_MODULE_LOADING"] = "LAZY" 2025-05-07T20:27:59.1917663Z > torch._C._cuda_init() 2025-05-07T20:27:59.1918492Z E RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library 2025-05-07T20:27:59.1919360Z E Falsifying example: test_coalesce_batches( 2025-05-07T20:27:59.1919768Z E # The test always failed when commented parts were varied together. 2025-05-07T20:27:59.1920268Z E self=, 2025-05-07T20:27:59.1920671Z E device=device(type='cuda'), 2025-05-07T20:27:59.1920995Z E batch_size=10, # or any other generated value 2025-05-07T20:27:59.1921346Z E num_inputs=1, # or any other generated value 2025-05-07T20:27:59.1921646Z E ) 2025-05-07T20:27:59.1921845Z E Explanation: 2025-05-07T20:27:59.1922160Z E These lines were always and only run by failing examples: 2025-05-07T20:27:59.1922721Z E /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:354 2025-05-07T20:27:59.1923179Z E 2025-05-07T20:27:59.1923702Z E You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQpBAQ==') as a decorator on your test case 2025-05-07T20:27:59.1924200Z 2025-05-07T20:27:59.1924530Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:382: RuntimeError 2025-05-07T20:27:59.1925150Z =============================== warnings summary =============================== 2025-05-07T20:27:59.1925676Z ../../../../../../../../_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181 2025-05-07T20:27:59.1927331Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:27:59.1928813Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:27:59.1929024Z 2025-05-07T20:27:59.1929250Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html 2025-05-07T20:27:59.1929692Z ======================== 1 failed, 1 warning in 16.25s ========================= 2025-05-07T20:27:59.6290955Z ERROR conda.cli.main_run:execute(47): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./coalesce/coalesce_test.py` failed. (See above for error) 2025-05-07T20:27:59.6717075Z [EXEC] [ATTEMPT 2/2] Command attempt failed. 2025-05-07T20:27:59.6717604Z 2025-05-07T20:27:59.6718067Z [EXEC] The command has failed after 2 + 1 attempts; aborting. 2025-05-07T20:27:59.6718655Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./coalesce/coalesce_test.py 2025-05-07T20:27:59.6719077Z 2025-05-07T20:27:59.6719081Z 2025-05-07T20:27:59.6719085Z 2025-05-07T20:27:59.6766117Z [NOVA] Time taken to test all unit tests: 119 seconds / 00:01:59 2025-05-07T20:27:59.8507425Z ##[group]Run set -euxo pipefail 2025-05-07T20:27:59.8507819Z set -euxo pipefail 2025-05-07T20:27:59.8508087Z source "${BUILD_ENV_FILE}" 2025-05-07T20:27:59.8508404Z WHEEL_NAME=$(ls "pytorch/FBGEMM/dist/") 2025-05-07T20:27:59.8508730Z echo "$WHEEL_NAME" 2025-05-07T20:27:59.8508998Z  2025-05-07T20:27:59.8509300Z ${CONDA_RUN} pip install "pytorch/FBGEMM/dist/$WHEEL_NAME" 2025-05-07T20:27:59.8509790Z # Checking that we have a pinned version of torch in our dependency tree 2025-05-07T20:27:59.8510177Z ( 2025-05-07T20:27:59.8510418Z  pushd "${RUNNER_TEMP}" 2025-05-07T20:27:59.8510864Z  unzip -o "${GITHUB_WORKSPACE}/pytorch/FBGEMM/dist/$WHEEL_NAME" 2025-05-07T20:27:59.8511384Z  # Ensure that pytorch version is pinned, should output file where it was found 2025-05-07T20:27:59.8511978Z  grep "Requires-Dist: torch (==.*)" -r . 2025-05-07T20:27:59.8512314Z ) 2025-05-07T20:27:59.8512494Z  2025-05-07T20:27:59.8512769Z if [[ (! -f "pytorch/FBGEMM"/${SMOKE_TEST_SCRIPT}) ]]; then 2025-05-07T20:27:59.8513185Z  echo "pytorch/FBGEMM/${SMOKE_TEST_SCRIPT} not found" 2025-05-07T20:27:59.8513573Z  if [[ "${PACKAGE_NAME}" = "torchrec" ]]; then 2025-05-07T20:27:59.8514031Z  # Special case for torchrec temporarily since __version__ does not 2025-05-07T20:27:59.8514495Z  # work correctly on main in torchrec. This block will be 2025-05-07T20:27:59.8514860Z  # removed once we fix it. 2025-05-07T20:27:59.8515193Z  ${CONDA_RUN} python -c "import ${PACKAGE_NAME}" 2025-05-07T20:27:59.8515509Z  else 2025-05-07T20:27:59.8515955Z  ${CONDA_RUN} python -c "import ${PACKAGE_NAME}; print('package version is ', ${PACKAGE_NAME}.__version__)" 2025-05-07T20:27:59.8516442Z  fi 2025-05-07T20:27:59.8516637Z else 2025-05-07T20:27:59.8516905Z  echo "pytorch/FBGEMM/${SMOKE_TEST_SCRIPT} found" 2025-05-07T20:27:59.8517328Z  ${CONDA_RUN} python "pytorch/FBGEMM/${SMOKE_TEST_SCRIPT}" 2025-05-07T20:27:59.8517664Z fi 2025-05-07T20:27:59.8517907Z shell: bash -l {0} 2025-05-07T20:27:59.8518220Z env: 2025-05-07T20:27:59.8518408Z PYTHON_VERSION: 3.9 2025-05-07T20:27:59.8518641Z PACKAGE_TYPE: wheel 2025-05-07T20:27:59.8518874Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:27:59.8519113Z REF: 2025-05-07T20:27:59.8519288Z CU_VERSION: cu128 2025-05-07T20:27:59.8519510Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:27:59.8519738Z ARCH: x86_64 2025-05-07T20:27:59.8519941Z BUILD_TARGET: genai 2025-05-07T20:27:59.8520150Z CHANNEL: nightly 2025-05-07T20:27:59.8520401Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:27:59.8520737Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:27:59.8521088Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:27:59.8521490Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:27:59.8521844Z PACKAGE_NAME: fbgemm_gpu 2025-05-07T20:27:59.8522081Z SMOKE_TEST_SCRIPT: 2025-05-07T20:27:59.8522287Z ##[endgroup] 2025-05-07T20:28:00.1092546Z + source /__w/_temp/build_env_14891846312 2025-05-07T20:28:00.1093036Z ++ export BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:28:00.1095523Z ++ BUILD_VERSION=0.1.0.dev20250507+cu128 2025-05-07T20:28:00.1095818Z ++ export CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:28:00.1096099Z ++ CUDA_HOME=/usr/local/cuda-12.8 2025-05-07T20:28:00.1096378Z ++ export CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:28:00.1096671Z ++ CUDA_PATH=/usr/local/cuda-12.8 2025-05-07T20:28:00.1096924Z ++ export FORCE_CUDA=1 2025-05-07T20:28:00.1097147Z ++ FORCE_CUDA=1 2025-05-07T20:28:00.1097932Z ++ export PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:28:00.1099767Z ++ PATH=/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:28:00.1101308Z ++ export PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:28:00.1103118Z ++ PATH=/usr/local/cuda-12.8/bin:/opt/python/cp39-cp39/bin:/usr/share/Modules/bin:/usr/local/cuda/bin:/opt/rh/gcc-toolset-11/root/usr/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-05-07T20:28:00.1104517Z ++ export 'PIP_INSTALL_TORCH=pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:28:00.1105358Z ++ PIP_INSTALL_TORCH='pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cu128' 2025-05-07T20:28:00.1106033Z ++ export PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:28:00.1106524Z ++ PYTORCH_S3_BUCKET_PATH=s3://pytorch/whl/nightly/cu128/ 2025-05-07T20:28:00.1106869Z ++ export PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:28:00.1107171Z ++ PYTORCH_VERSION_SUFFIX=+cu128 2025-05-07T20:28:00.1107527Z ++ export 'TORCH_CUDA_ARCH_LIST=5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:28:00.1107972Z ++ TORCH_CUDA_ARCH_LIST='5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:28:00.1108317Z ++ export VERSION_SUFFIX=+cu128 2025-05-07T20:28:00.1108572Z ++ VERSION_SUFFIX=+cu128 2025-05-07T20:28:00.1108811Z ++ export WHEEL_DIR=cu128/ 2025-05-07T20:28:00.1109037Z ++ WHEEL_DIR=cu128/ 2025-05-07T20:28:00.1109273Z ++ FBGEMM_DIR=/__w/FBGEMM/FBGEMM 2025-05-07T20:28:00.1109657Z ++ export FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:28:00.1110018Z ++ FBGEMM_REPO=/__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:28:00.1110358Z +++ pwd 2025-05-07T20:28:00.1110578Z ++ working_dir=/__w/FBGEMM/FBGEMM 2025-05-07T20:28:00.1111022Z ++ [[ /__w/FBGEMM/FBGEMM == \/\_\_\w\/\F\B\G\E\M\M\/\F\B\G\E\M\M\/\p\y\t\o\r\c\h\/\F\B\G\E\M\M ]] 2025-05-07T20:28:00.1111418Z ++ export BUILD_FROM_NOVA=1 2025-05-07T20:28:00.1111657Z ++ BUILD_FROM_NOVA=1 2025-05-07T20:28:00.1111981Z ++ [[ cu128 == \c\u* ]] 2025-05-07T20:28:00.1112463Z ++ echo 'Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0' 2025-05-07T20:28:00.1112903Z ++ [[ /__w/_temp/conda_environment_14891846312 != '' ]] 2025-05-07T20:28:00.1113392Z ++ export 'CONDA_RUN=conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:28:00.1114006Z ++ CONDA_RUN='conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:28:00.1114575Z ++ echo 'conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312' 2025-05-07T20:28:00.1114990Z ++ [[ cu128 == \c\u\1\2\8 ]] 2025-05-07T20:28:00.1115287Z ++ export 'TORCH_CUDA_ARCH_LIST=7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:28:00.1115662Z ++ TORCH_CUDA_ARCH_LIST='7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:28:00.1116037Z ++ echo 'Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a' 2025-05-07T20:28:00.1116496Z Current TORCH_CUDA_ARCH_LIST value: 5.0+PTX;6.0;7.0;7.5;8.0;8.6;9.0;10.0;12.0 2025-05-07T20:28:00.1117000Z conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:28:00.1117491Z Set TORCH_CUDA_ARCH_LIST to: 7.0;8.0;9.0;9.0a;10.0a;12.0a 2025-05-07T20:28:00.1117823Z ++ ls pytorch/FBGEMM/dist/ 2025-05-07T20:28:00.1136055Z + WHEEL_NAME=fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:00.1136743Z fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:00.1137267Z + echo fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:00.1138129Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 pip install pytorch/FBGEMM/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:01.0342309Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:28:01.0343100Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:28:01.3518958Z Processing ./pytorch/FBGEMM/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:01.3713565Z Requirement already satisfied: numpy in /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages (from fbgemm-gpu-genai==2025.5.7+cu128) (2.0.2) 2025-05-07T20:28:01.3730549Z fbgemm-gpu-genai is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel. 2025-05-07T20:28:01.5045632Z WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning. 2025-05-07T20:28:01.5969038Z + pushd /__w/_temp 2025-05-07T20:28:01.5969603Z + unzip -o /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:01.5970174Z /__w/_temp /__w/FBGEMM/FBGEMM 2025-05-07T20:28:01.6017100Z Archive: /__w/FBGEMM/FBGEMM/pytorch/FBGEMM/dist/fbgemm_gpu_genai-2025.5.7+cu128-cp39-cp39-manylinux_2_28_x86_64.whl 2025-05-07T20:28:01.6017774Z inflating: fbgemm_gpu/__init__.py 2025-05-07T20:28:01.6063039Z inflating: fbgemm_gpu/asmjit.so 2025-05-07T20:28:01.6063418Z inflating: fbgemm_gpu/batched_unary_embeddings_ops.py 2025-05-07T20:28:01.6065087Z inflating: fbgemm_gpu/enums.py 2025-05-07T20:28:01.6379619Z inflating: fbgemm_gpu/fbgemm.so 2025-05-07T20:28:01.6380181Z inflating: fbgemm_gpu/metrics.py 2025-05-07T20:28:01.6380970Z inflating: fbgemm_gpu/permute_pooled_embedding_modules.py 2025-05-07T20:28:01.6382516Z inflating: fbgemm_gpu/permute_pooled_embedding_modules_split.py 2025-05-07T20:28:01.6384999Z inflating: fbgemm_gpu/quantize_comm.py 2025-05-07T20:28:01.6386671Z inflating: fbgemm_gpu/quantize_utils.py 2025-05-07T20:28:01.6388309Z inflating: fbgemm_gpu/runtime_monitor.py 2025-05-07T20:28:01.6392042Z inflating: fbgemm_gpu/sparse_ops.py 2025-05-07T20:28:01.6393325Z inflating: fbgemm_gpu/split_embedding_configs.py 2025-05-07T20:28:01.6395262Z inflating: fbgemm_gpu/split_embedding_inference_converter.py 2025-05-07T20:28:01.6395948Z inflating: fbgemm_gpu/split_embedding_optimizer_ops.py 2025-05-07T20:28:01.6396925Z inflating: fbgemm_gpu/split_embedding_utils.py 2025-05-07T20:28:01.6398755Z inflating: fbgemm_gpu/split_table_batched_embeddings_ops.py 2025-05-07T20:28:01.6399957Z inflating: fbgemm_gpu/split_table_batched_embeddings_ops_common.py 2025-05-07T20:28:01.6406195Z inflating: fbgemm_gpu/split_table_batched_embeddings_ops_inference.py 2025-05-07T20:28:01.6416975Z inflating: fbgemm_gpu/split_table_batched_embeddings_ops_training.py 2025-05-07T20:28:01.6418045Z inflating: fbgemm_gpu/split_table_batched_embeddings_ops_training_common.py 2025-05-07T20:28:01.6419063Z inflating: fbgemm_gpu/ssd_split_table_batched_embeddings_ops.py 2025-05-07T20:28:01.6420570Z inflating: fbgemm_gpu/tbe_input_multiplexer.py 2025-05-07T20:28:01.6421460Z inflating: fbgemm_gpu/uvm.py 2025-05-07T20:28:01.6422574Z inflating: fbgemm_gpu/config/__init__.py 2025-05-07T20:28:01.6424196Z inflating: fbgemm_gpu/config/feature_list.py 2025-05-07T20:28:01.6425714Z inflating: fbgemm_gpu/docs/__init__.py 2025-05-07T20:28:01.6426061Z inflating: fbgemm_gpu/docs/common.py 2025-05-07T20:28:01.6428008Z inflating: fbgemm_gpu/docs/examples.py 2025-05-07T20:28:01.6429492Z inflating: fbgemm_gpu/docs/jagged_tensor_ops.py 2025-05-07T20:28:01.6430246Z inflating: fbgemm_gpu/docs/merge_pooled_embedding_ops.py 2025-05-07T20:28:01.6432365Z inflating: fbgemm_gpu/docs/permute_pooled_embedding_ops.py 2025-05-07T20:28:01.6433031Z inflating: fbgemm_gpu/docs/quantize_ops.py 2025-05-07T20:28:01.6436049Z inflating: fbgemm_gpu/docs/sparse_ops.py 2025-05-07T20:28:01.6436643Z inflating: fbgemm_gpu/docs/version.py 2025-05-07T20:28:01.6438473Z inflating: fbgemm_gpu/experimental/bench/__init__.py 2025-05-07T20:28:01.6440341Z inflating: fbgemm_gpu/experimental/bench/ck_bf16_bench.py 2025-05-07T20:28:01.6442338Z inflating: fbgemm_gpu/experimental/bench/comm_bench.py 2025-05-07T20:28:01.6444511Z inflating: fbgemm_gpu/experimental/bench/gather_scatter_bench.py 2025-05-07T20:28:01.6446845Z inflating: fbgemm_gpu/experimental/bench/quantize_bench.py 2025-05-07T20:28:01.6451482Z inflating: fbgemm_gpu/experimental/bench/quantize_ops.py 2025-05-07T20:28:01.6452936Z inflating: fbgemm_gpu/experimental/example/__init__.py 2025-05-07T20:28:01.6483670Z inflating: fbgemm_gpu/experimental/example/fbgemm_gpu_experimental_example_py.so 2025-05-07T20:28:01.6484188Z inflating: fbgemm_gpu/experimental/example/utils.py 2025-05-07T20:28:01.6486450Z inflating: fbgemm_gpu/experimental/gemm/triton_gemm/__init__.py 2025-05-07T20:28:01.6494997Z inflating: fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py 2025-05-07T20:28:01.6498271Z inflating: fbgemm_gpu/experimental/gemm/triton_gemm/grouped_gemm.py 2025-05-07T20:28:01.6499988Z inflating: fbgemm_gpu/experimental/gemm/triton_gemm/matmul_perf_model.py 2025-05-07T20:28:01.6501146Z inflating: fbgemm_gpu/experimental/gemm/triton_gemm/utils.py 2025-05-07T20:28:01.6502661Z inflating: fbgemm_gpu/experimental/gen_ai/__init__.py 2025-05-07T20:28:02.1382315Z inflating: fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai.so 2025-05-07T20:28:02.1383325Z inflating: fbgemm_gpu/experimental/gen_ai/quantize.py 2025-05-07T20:28:02.1384150Z inflating: fbgemm_gpu/experimental/gen_ai/moe/README.md 2025-05-07T20:28:02.1385324Z inflating: fbgemm_gpu/experimental/gen_ai/moe/__init__.py 2025-05-07T20:28:02.1386219Z inflating: fbgemm_gpu/experimental/gen_ai/moe/activation.py 2025-05-07T20:28:02.1387984Z inflating: fbgemm_gpu/experimental/gen_ai/moe/gather_scatter.py 2025-05-07T20:28:02.1391243Z inflating: fbgemm_gpu/experimental/gen_ai/moe/layers.py 2025-05-07T20:28:02.1393107Z inflating: fbgemm_gpu/experimental/gen_ai/moe/shuffling.py 2025-05-07T20:28:02.1393703Z inflating: fbgemm_gpu/quantize/__init__.py 2025-05-07T20:28:02.1394723Z inflating: fbgemm_gpu/quantize/quantize_ops.py 2025-05-07T20:28:02.1397065Z inflating: fbgemm_gpu/sll/__init__.py 2025-05-07T20:28:02.1399261Z inflating: fbgemm_gpu/sll/cpu/__init__.py 2025-05-07T20:28:02.1402879Z inflating: fbgemm_gpu/sll/cpu/cpu_sll.py 2025-05-07T20:28:02.1404386Z inflating: fbgemm_gpu/sll/meta/__init__.py 2025-05-07T20:28:02.1405638Z inflating: fbgemm_gpu/sll/meta/meta_sll.py 2025-05-07T20:28:02.1407281Z inflating: fbgemm_gpu/sll/triton/__init__.py 2025-05-07T20:28:02.1408190Z inflating: fbgemm_gpu/sll/triton/common.py 2025-05-07T20:28:02.1409753Z inflating: fbgemm_gpu/sll/triton/triton_dense_jagged_cat_jagged_out.py 2025-05-07T20:28:02.1410828Z inflating: fbgemm_gpu/sll/triton/triton_jagged2_to_padded_dense.py 2025-05-07T20:28:02.1412910Z inflating: fbgemm_gpu/sll/triton/triton_jagged_bmm.py 2025-05-07T20:28:02.1415120Z inflating: fbgemm_gpu/sll/triton/triton_jagged_bmm_jagged_out.py 2025-05-07T20:28:02.1416132Z inflating: fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_add.py 2025-05-07T20:28:02.1418038Z inflating: fbgemm_gpu/sll/triton/triton_jagged_dense_elementwise_mul_jagged_out.py 2025-05-07T20:28:02.1420266Z inflating: fbgemm_gpu/sll/triton/triton_jagged_dense_flash_attention.py 2025-05-07T20:28:02.1422369Z inflating: fbgemm_gpu/sll/triton/triton_jagged_flash_attention_basic.py 2025-05-07T20:28:02.1423317Z inflating: fbgemm_gpu/sll/triton/triton_jagged_self_substraction_jagged_out.py 2025-05-07T20:28:02.1425471Z inflating: fbgemm_gpu/sll/triton/triton_jagged_softmax.py 2025-05-07T20:28:02.1427841Z inflating: fbgemm_gpu/sll/triton/triton_multi_head_jagged_flash_attention.py 2025-05-07T20:28:02.1428949Z inflating: fbgemm_gpu/tbe/__init__.py 2025-05-07T20:28:02.1430979Z inflating: fbgemm_gpu/tbe/bench/__init__.py 2025-05-07T20:28:02.1432422Z inflating: fbgemm_gpu/tbe/bench/bench_config.py 2025-05-07T20:28:02.1434785Z inflating: fbgemm_gpu/tbe/bench/bench_runs.py 2025-05-07T20:28:02.1435783Z inflating: fbgemm_gpu/tbe/bench/eeg_cli.py 2025-05-07T20:28:02.1437924Z inflating: fbgemm_gpu/tbe/bench/embedding_ops_common_config.py 2025-05-07T20:28:02.1438812Z inflating: fbgemm_gpu/tbe/bench/eval_compression.py 2025-05-07T20:28:02.1439654Z inflating: fbgemm_gpu/tbe/bench/reporter.py 2025-05-07T20:28:02.1442385Z inflating: fbgemm_gpu/tbe/bench/tbe_data_config.py 2025-05-07T20:28:02.1444085Z inflating: fbgemm_gpu/tbe/bench/tbe_data_config_loader.py 2025-05-07T20:28:02.1445585Z inflating: fbgemm_gpu/tbe/bench/tbe_data_config_param_models.py 2025-05-07T20:28:02.1446539Z inflating: fbgemm_gpu/tbe/bench/utils.py 2025-05-07T20:28:02.1448162Z inflating: fbgemm_gpu/tbe/cache/__init__.py 2025-05-07T20:28:02.1449030Z inflating: fbgemm_gpu/tbe/cache/split_embeddings_cache_ops.py 2025-05-07T20:28:02.1450706Z inflating: fbgemm_gpu/tbe/ssd/__init__.py 2025-05-07T20:28:02.1451570Z inflating: fbgemm_gpu/tbe/ssd/common.py 2025-05-07T20:28:02.1454848Z inflating: fbgemm_gpu/tbe/ssd/inference.py 2025-05-07T20:28:02.1461245Z inflating: fbgemm_gpu/tbe/ssd/training.py 2025-05-07T20:28:02.1462261Z inflating: fbgemm_gpu/tbe/ssd/utils/__init__.py 2025-05-07T20:28:02.1464306Z inflating: fbgemm_gpu/tbe/ssd/utils/partially_materialized_tensor.py 2025-05-07T20:28:02.1465209Z inflating: fbgemm_gpu/tbe/stats/__init__.py 2025-05-07T20:28:02.1467127Z inflating: fbgemm_gpu/tbe/stats/bench_params_reporter.py 2025-05-07T20:28:02.1468124Z inflating: fbgemm_gpu/tbe/utils/__init__.py 2025-05-07T20:28:02.1470351Z inflating: fbgemm_gpu/tbe/utils/common.py 2025-05-07T20:28:02.1480269Z inflating: fbgemm_gpu/tbe/utils/offsets.py 2025-05-07T20:28:02.1480680Z inflating: fbgemm_gpu/tbe/utils/quantize.py 2025-05-07T20:28:02.1481152Z inflating: fbgemm_gpu/tbe/utils/requests.py 2025-05-07T20:28:02.1481512Z inflating: fbgemm_gpu/triton/__init__.py 2025-05-07T20:28:02.1481925Z inflating: fbgemm_gpu/triton/common.py 2025-05-07T20:28:02.1482338Z inflating: fbgemm_gpu/triton/quantize.py 2025-05-07T20:28:02.1484166Z inflating: fbgemm_gpu/triton/quantize_ref.py 2025-05-07T20:28:02.1485324Z inflating: fbgemm_gpu/triton/jagged/__init__.py 2025-05-07T20:28:02.1488747Z inflating: fbgemm_gpu/triton/jagged/triton_jagged_tensor_ops.py 2025-05-07T20:28:02.1489489Z inflating: fbgemm_gpu/utils/__init__.py 2025-05-07T20:28:02.1491700Z inflating: fbgemm_gpu/utils/filestore.py 2025-05-07T20:28:02.1492114Z inflating: fbgemm_gpu/utils/loader.py 2025-05-07T20:28:02.1493793Z inflating: fbgemm_gpu/utils/torch_library.py 2025-05-07T20:28:02.1495516Z inflating: fbgemm_gpu_genai-2025.5.7+cu128.dist-info/METADATA 2025-05-07T20:28:02.1496096Z inflating: fbgemm_gpu_genai-2025.5.7+cu128.dist-info/WHEEL 2025-05-07T20:28:02.1496942Z inflating: fbgemm_gpu_genai-2025.5.7+cu128.dist-info/top_level.txt 2025-05-07T20:28:02.1499481Z inflating: fbgemm_gpu_genai-2025.5.7+cu128.dist-info/RECORD 2025-05-07T20:28:02.1504691Z + grep 'Requires-Dist: torch (==.*)' -r . 2025-05-07T20:29:25.3489780Z ./17b58713-08f8-4d8e-90e5-a67221396852.sh: grep "Requires-Dist: torch (==.*)" -r . 2025-05-07T20:29:25.3510412Z + [[ ! -f pytorch/FBGEMM/ ]] 2025-05-07T20:29:25.3510675Z pytorch/FBGEMM/ not found 2025-05-07T20:29:25.3510952Z + echo 'pytorch/FBGEMM/ not found' 2025-05-07T20:29:25.3511236Z + [[ fbgemm_gpu = \t\o\r\c\h\r\e\c ]] 2025-05-07T20:29:25.3512040Z + conda run --no-capture-output -p /__w/_temp/conda_environment_14891846312 python -c 'import fbgemm_gpu; print('\''package version is '\'', fbgemm_gpu.__version__)' 2025-05-07T20:29:26.2517463Z WARNING: overwriting environment variables set in the machine 2025-05-07T20:29:26.2517866Z overwriting variable {'LD_LIBRARY_PATH'} 2025-05-07T20:29:27.4434500Z /__w/_temp/conda_environment_14891846312/lib/python3.9/site-packages/torch/cuda/__init__.py:181: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 34: CUDA driver is a stub library (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.) 2025-05-07T20:29:27.4436104Z return torch._C._cuda_getDeviceCount() > 0 2025-05-07T20:29:27.6970845Z package version is 2025.5.7+cu128 2025-05-07T20:29:28.0884624Z ##[group]Run actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 2025-05-07T20:29:28.0885425Z with: 2025-05-07T20:29:28.0885643Z name: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:29:28.0885928Z path: pytorch/FBGEMM/dist/ 2025-05-07T20:29:28.0886192Z if-no-files-found: warn 2025-05-07T20:29:28.0886428Z compression-level: 6 2025-05-07T20:29:28.0886658Z overwrite: false 2025-05-07T20:29:28.0886882Z include-hidden-files: false 2025-05-07T20:29:28.0887114Z env: 2025-05-07T20:29:28.0887302Z PYTHON_VERSION: 3.9 2025-05-07T20:29:28.0887521Z PACKAGE_TYPE: wheel 2025-05-07T20:29:28.0887754Z REPOSITORY: pytorch/FBGEMM 2025-05-07T20:29:28.0887994Z REF: 2025-05-07T20:29:28.0888180Z CU_VERSION: cu128 2025-05-07T20:29:28.0888387Z UPLOAD_TO_BASE_BUCKET: no 2025-05-07T20:29:28.0888617Z ARCH: x86_64 2025-05-07T20:29:28.0888806Z BUILD_TARGET: genai 2025-05-07T20:29:28.0889015Z CHANNEL: nightly 2025-05-07T20:29:28.0889290Z ARTIFACT_NAME: pytorch_FBGEMM__3.9_cu128_x86_64 2025-05-07T20:29:28.0889627Z BUILD_ENV_FILE: /__w/_temp/build_env_14891846312 2025-05-07T20:29:28.0889977Z CONDA_ENV: /__w/_temp/conda_environment_14891846312 2025-05-07T20:29:28.0890378Z CONDA_RUN: conda run -p /__w/_temp/conda_environment_14891846312 2025-05-07T20:29:28.0890726Z ##[endgroup] 2025-05-07T20:29:28.0893389Z ##[command]/usr/bin/docker exec d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 sh -c "cat /etc/*release | grep ^ID" 2025-05-07T20:29:28.5158709Z With the provided path, there will be 1 file uploaded 2025-05-07T20:29:28.5163804Z Artifact name is valid! 2025-05-07T20:29:28.5164675Z Root directory input is valid! 2025-05-07T20:29:28.6272977Z Beginning upload of artifact content to blob storage 2025-05-07T20:29:29.1672237Z Uploaded bytes 8388608 2025-05-07T20:29:29.4456382Z Uploaded bytes 16777216 2025-05-07T20:29:29.4981342Z Uploaded bytes 18512980 2025-05-07T20:29:29.5171432Z Finished uploading artifact content to blob storage! 2025-05-07T20:29:29.5174669Z SHA256 digest of uploaded artifact zip is 288948832c64b68ac50108a75951177e9bd5a8870ff753ce127942be614b4f46 2025-05-07T20:29:29.5176839Z Finalizing artifact upload 2025-05-07T20:29:29.5921850Z Artifact pytorch_FBGEMM__3.9_cu128_x86_64.zip successfully finalized. Artifact ID 3081574154 2025-05-07T20:29:29.5922634Z Artifact pytorch_FBGEMM__3.9_cu128_x86_64 has been successfully uploaded! Final size is 18512980 bytes. Artifact ID is 3081574154 2025-05-07T20:29:29.5931023Z Artifact download URL: https://github.com/pytorch/FBGEMM/actions/runs/14891846312/artifacts/3081574154 2025-05-07T20:29:29.6165949Z Post job cleanup. 2025-05-07T20:29:29.6229823Z Post job cleanup. 2025-05-07T20:29:29.6318566Z Post job cleanup. 2025-05-07T20:29:29.6364681Z Post job cleanup. 2025-05-07T20:29:29.6368771Z ##[command]/usr/bin/docker exec d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 sh -c "cat /etc/*release | grep ^ID" 2025-05-07T20:29:29.9348660Z [command]/opt/conda/bin/git version 2025-05-07T20:29:29.9413092Z git version 2.40.1 2025-05-07T20:29:29.9451820Z Copying '/github/home/.gitconfig' to '/__w/_temp/aed662aa-169a-46e1-b26f-531ef4d616da/.gitconfig' 2025-05-07T20:29:29.9461415Z Temporarily overriding HOME='/__w/_temp/aed662aa-169a-46e1-b26f-531ef4d616da' before making global git config changes 2025-05-07T20:29:29.9462170Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:29:29.9477451Z [command]/opt/conda/bin/git config --global --add safe.directory /__w/FBGEMM/FBGEMM/pytorch/FBGEMM 2025-05-07T20:29:29.9528732Z [command]/opt/conda/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:29:29.9568850Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:29:30.0060224Z Entering 'external/asmjit' 2025-05-07T20:29:30.0156206Z Entering 'external/composable_kernel' 2025-05-07T20:29:30.0258925Z Entering 'external/cpuinfo' 2025-05-07T20:29:30.0354444Z Entering 'external/cutlass' 2025-05-07T20:29:30.0457479Z Entering 'external/googletest' 2025-05-07T20:29:30.0554392Z Entering 'external/hipify_torch' 2025-05-07T20:29:30.0653257Z Entering 'external/json' 2025-05-07T20:29:30.0767385Z [command]/opt/conda/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:29:30.0797527Z http.https://github.com/.extraheader 2025-05-07T20:29:30.0809472Z [command]/opt/conda/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:29:30.0848158Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:29:30.1315460Z Entering 'external/asmjit' 2025-05-07T20:29:30.1379372Z http.https://github.com/.extraheader 2025-05-07T20:29:30.1436565Z Entering 'external/composable_kernel' 2025-05-07T20:29:30.1510260Z http.https://github.com/.extraheader 2025-05-07T20:29:30.1573270Z Entering 'external/cpuinfo' 2025-05-07T20:29:30.1639456Z http.https://github.com/.extraheader 2025-05-07T20:29:30.1698684Z Entering 'external/cutlass' 2025-05-07T20:29:30.1764495Z http.https://github.com/.extraheader 2025-05-07T20:29:30.1829728Z Entering 'external/googletest' 2025-05-07T20:29:30.1895855Z http.https://github.com/.extraheader 2025-05-07T20:29:30.1951972Z Entering 'external/hipify_torch' 2025-05-07T20:29:30.2019226Z http.https://github.com/.extraheader 2025-05-07T20:29:30.2077592Z Entering 'external/json' 2025-05-07T20:29:30.2144867Z http.https://github.com/.extraheader 2025-05-07T20:29:30.2408853Z Post job cleanup. 2025-05-07T20:29:30.2413237Z ##[command]/usr/bin/docker exec d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 sh -c "cat /etc/*release | grep ^ID" 2025-05-07T20:29:30.5620812Z [command]/opt/conda/bin/git version 2025-05-07T20:29:30.5664450Z git version 2.40.1 2025-05-07T20:29:30.5704652Z Copying '/github/home/.gitconfig' to '/__w/_temp/041888f0-3f7c-46cc-81de-a958b8d64f0e/.gitconfig' 2025-05-07T20:29:30.5714812Z Temporarily overriding HOME='/__w/_temp/041888f0-3f7c-46cc-81de-a958b8d64f0e' before making global git config changes 2025-05-07T20:29:30.5715936Z Adding repository directory to the temporary git global config as a safe directory 2025-05-07T20:29:30.5730494Z [command]/opt/conda/bin/git config --global --add safe.directory /__w/FBGEMM/FBGEMM/test-infra 2025-05-07T20:29:30.5779689Z [command]/opt/conda/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-05-07T20:29:30.5820232Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-05-07T20:29:30.6302583Z [command]/opt/conda/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-05-07T20:29:30.6333002Z http.https://github.com/.extraheader 2025-05-07T20:29:30.6345391Z [command]/opt/conda/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-05-07T20:29:30.6386284Z [command]/opt/conda/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-05-07T20:29:30.7005862Z Stop and remove container: c094c8ba304b42fabfe3662b3a2b7dff_pytorchmanylinux2_28buildercuda128_5a87e8 2025-05-07T20:29:30.7010433Z ##[command]/usr/bin/docker rm --force d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 2025-05-07T20:29:32.6087751Z d075a29d5e53b1d3b22e3b2980b7635eb69e21447a2746fe496e4e635d7712c5 2025-05-07T20:29:32.6127043Z Remove container network: github_network_647c08a59ff240119465636525774298 2025-05-07T20:29:32.6131026Z ##[command]/usr/bin/docker network rm github_network_647c08a59ff240119465636525774298 2025-05-07T20:29:33.6361232Z github_network_647c08a59ff240119465636525774298 2025-05-07T20:29:33.6395374Z A job completed hook has been configured by the self-hosted runner administrator 2025-05-07T20:29:33.6417033Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-05-07T20:29:33.6424876Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-05-07T20:29:33.6425222Z ##[endgroup] 2025-05-07T20:29:43.5582273Z Cleaning up orphan processes