2025-07-02T08:02:30.7771525Z Current runner version: '2.325.0' 2025-07-02T08:02:30.7777090Z Runner name: 'i-07e2f8e738f67d521' 2025-07-02T08:02:30.7777810Z Runner group name: 'default' 2025-07-02T08:02:30.7778686Z Machine name: 'ip-10-0-48-158' 2025-07-02T08:02:30.7781402Z ##[group]GITHUB_TOKEN Permissions 2025-07-02T08:02:30.7783486Z Contents: read 2025-07-02T08:02:30.7783996Z Metadata: read 2025-07-02T08:02:30.7784490Z ##[endgroup] 2025-07-02T08:02:30.7786538Z Secret source: Actions 2025-07-02T08:02:30.7787271Z Prepare workflow directory 2025-07-02T08:02:30.8336717Z Prepare all required actions 2025-07-02T08:02:30.8373041Z Getting action download info 2025-07-02T08:02:31.1724031Z Download action repository 'actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683) 2025-07-02T08:02:31.4459623Z Download action repository 'pytorch/pytorch@main' (SHA:0364db7cd14ffa67b48ef8c27fefbb3eed2b065d) 2025-07-02T08:02:46.5344273Z Download action repository 'actions/download-artifact@d3f86a106a0bac45b974a628896c90dbdf5c8093' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-07-02T08:02:46.8907344Z Download action repository 'pmeier/pytest-results-action@a2c1430e2bddadbad9f49a6f9b879f062c6b19b1' (SHA:a2c1430e2bddadbad9f49a6f9b879f062c6b19b1) 2025-07-02T08:02:47.0158587Z Download action repository 'actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-07-02T08:02:47.4616822Z Download action repository 'seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-07-02T08:02:47.7382925Z Getting action download info 2025-07-02T08:02:47.9038407Z Uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@refs/heads/main (4e43bd8700fb3fac32b6155020e13e6033eb4bcb) 2025-07-02T08:02:47.9042267Z ##[group] Inputs 2025-07-02T08:02:47.9044773Z script: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:47.9047720Z timeout: 90 2025-07-02T08:02:47.9047973Z runner: linux.g5.4xlarge.nvidia.gpu 2025-07-02T08:02:47.9048281Z upload-artifact: 2025-07-02T08:02:47.9048785Z upload-artifact-to-s3: false 2025-07-02T08:02:47.9049090Z download-artifact: 2025-07-02T08:02:47.9049341Z repository: pytorch/rl 2025-07-02T08:02:47.9049604Z fetch-depth: 1 2025-07-02T08:02:47.9049823Z submodules: 2025-07-02T08:02:47.9050032Z ref: 2025-07-02T08:02:47.9050273Z test-infra-repository: pytorch/test-infra 2025-07-02T08:02:47.9050612Z test-infra-ref: 2025-07-02T08:02:47.9050873Z use-custom-docker-registry: true 2025-07-02T08:02:47.9051223Z docker-image: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:47.9051616Z docker-build-dir: .ci/docker 2025-07-02T08:02:47.9051919Z gpu-arch-type: cuda 2025-07-02T08:02:47.9052169Z gpu-arch-version: 12.8 2025-07-02T08:02:47.9052416Z job-name: linux-job 2025-07-02T08:02:47.9052664Z continue-on-error: false 2025-07-02T08:02:47.9052919Z binary-matrix: 2025-07-02T08:02:47.9053160Z run-with-docker: true 2025-07-02T08:02:47.9053424Z secrets-env: 2025-07-02T08:02:47.9053638Z no-sudo: false 2025-07-02T08:02:47.9053866Z ##[endgroup] 2025-07-02T08:02:47.9054115Z Complete job name: tests (3.9, 12.8) / linux-job 2025-07-02T08:02:47.9697270Z A job started hook has been configured by the self-hosted runner administrator 2025-07-02T08:02:47.9834440Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-07-02T08:02:47.9846281Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:02:47.9847175Z ##[endgroup] 2025-07-02T08:02:49.3350605Z Runner Type: linux.g5.4xlarge.nvidia.gpu 2025-07-02T08:02:49.3351147Z Instance Type: g5.4xlarge 2025-07-02T08:02:49.3351403Z AMI Name: unknown 2025-07-02T08:02:49.3395885Z AMI ID: ami-05ffe3c48a9991133 2025-07-02T08:02:54.8873499Z ##[group]Run set -euxo pipefail 2025-07-02T08:02:54.8873900Z set -euxo pipefail 2025-07-02T08:02:54.8874205Z if [[ "${NO_SUDO}" == "false" ]]; then 2025-07-02T08:02:54.8874587Z  echo "::group::Cleanup with-sudo debug output" 2025-07-02T08:02:54.8874979Z  sudo rm -rfv "${GITHUB_WORKSPACE}" 2025-07-02T08:02:54.8875289Z else 2025-07-02T08:02:54.8875558Z  echo "::group::Cleanup no-sudo debug output" 2025-07-02T08:02:54.8875931Z  rm -rfv "${GITHUB_WORKSPACE}" 2025-07-02T08:02:54.8876230Z fi 2025-07-02T08:02:54.8876442Z  2025-07-02T08:02:54.8876670Z mkdir -p "${GITHUB_WORKSPACE}" 2025-07-02T08:02:54.8877007Z echo "::endgroup::" 2025-07-02T08:02:54.8891964Z shell: /usr/bin/bash -e {0} 2025-07-02T08:02:54.8892231Z env: 2025-07-02T08:02:54.8892479Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:54.8892841Z REPOSITORY: pytorch/rl 2025-07-02T08:02:54.8893135Z PR_NUMBER: 3030 2025-07-02T08:02:54.8895724Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:54.8898249Z NO_SUDO: false 2025-07-02T08:02:54.8898464Z ##[endgroup] 2025-07-02T08:02:54.8936477Z + [[ false == \f\a\l\s\e ]] 2025-07-02T08:02:54.8949516Z ##[group]Cleanup with-sudo debug output 2025-07-02T08:02:54.8952355Z + echo '::group::Cleanup with-sudo debug output' 2025-07-02T08:02:54.8952807Z + sudo rm -rfv /home/ec2-user/actions-runner/_work/rl/rl 2025-07-02T08:02:54.9357192Z removed directory '/home/ec2-user/actions-runner/_work/rl/rl' 2025-07-02T08:02:54.9380980Z + mkdir -p /home/ec2-user/actions-runner/_work/rl/rl 2025-07-02T08:02:54.9398522Z + echo ::endgroup:: 2025-07-02T08:02:54.9399312Z ##[endgroup] 2025-07-02T08:02:54.9517518Z ##[group]Run actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 2025-07-02T08:02:54.9518067Z with: 2025-07-02T08:02:54.9518358Z repository: pytorch/test-infra 2025-07-02T08:02:54.9518730Z path: test-infra 2025-07-02T08:02:54.9519021Z submodules: recursive 2025-07-02T08:02:54.9519618Z token: *** 2025-07-02T08:02:54.9519930Z ssh-strict: true 2025-07-02T08:02:54.9520206Z ssh-user: git 2025-07-02T08:02:54.9520526Z persist-credentials: true 2025-07-02T08:02:54.9520851Z clean: true 2025-07-02T08:02:54.9521155Z sparse-checkout-cone-mode: true 2025-07-02T08:02:54.9521513Z fetch-depth: 1 2025-07-02T08:02:54.9521796Z fetch-tags: false 2025-07-02T08:02:54.9522096Z show-progress: true 2025-07-02T08:02:54.9522387Z lfs: false 2025-07-02T08:02:54.9522665Z set-safe-directory: true 2025-07-02T08:02:54.9522975Z env: 2025-07-02T08:02:54.9523290Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:54.9523726Z REPOSITORY: pytorch/rl 2025-07-02T08:02:54.9524083Z PR_NUMBER: 3030 2025-07-02T08:02:54.9526989Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:54.9530213Z ##[endgroup] 2025-07-02T08:02:55.0989269Z Syncing repository: pytorch/test-infra 2025-07-02T08:02:55.0990298Z ##[group]Getting Git version info 2025-07-02T08:02:55.0991023Z Working directory is '/home/ec2-user/actions-runner/_work/rl/rl/test-infra' 2025-07-02T08:02:55.0992079Z [command]/usr/bin/git version 2025-07-02T08:02:55.1000420Z git version 2.47.1 2025-07-02T08:02:55.1027111Z ##[endgroup] 2025-07-02T08:02:55.1050865Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/4f34d488-136d-477c-90d0-679d2ffde98d' before making global git config changes 2025-07-02T08:02:55.1052382Z Adding repository directory to the temporary git global config as a safe directory 2025-07-02T08:02:55.1057370Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/rl/rl/test-infra 2025-07-02T08:02:55.1115980Z ##[group]Initializing the repository 2025-07-02T08:02:55.1116554Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/rl/rl/test-infra 2025-07-02T08:02:55.1152248Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-07-02T08:02:55.1152907Z hint: is subject to change. To configure the initial branch name to use in all 2025-07-02T08:02:55.1153563Z hint: of your new repositories, which will suppress this warning, call: 2025-07-02T08:02:55.1154018Z hint: 2025-07-02T08:02:55.1154340Z hint: git config --global init.defaultBranch 2025-07-02T08:02:55.1154701Z hint: 2025-07-02T08:02:55.1155056Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-07-02T08:02:55.1155655Z hint: 'development'. The just-created branch can be renamed via this command: 2025-07-02T08:02:55.1156107Z hint: 2025-07-02T08:02:55.1156338Z hint: git branch -m 2025-07-02T08:02:55.1156906Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/rl/rl/test-infra/.git/ 2025-07-02T08:02:55.1164678Z [command]/usr/bin/git remote add origin https://github.com/pytorch/test-infra 2025-07-02T08:02:55.1475897Z ##[endgroup] 2025-07-02T08:02:55.1476337Z ##[group]Disabling automatic garbage collection 2025-07-02T08:02:55.1481592Z [command]/usr/bin/git config --local gc.auto 0 2025-07-02T08:02:55.1528405Z ##[endgroup] 2025-07-02T08:02:55.1528810Z ##[group]Setting up auth 2025-07-02T08:02:55.1535156Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-07-02T08:02:55.1570183Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-07-02T08:02:55.1993361Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-07-02T08:02:55.2028774Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-07-02T08:02:55.2421363Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-07-02T08:02:55.2473725Z ##[endgroup] 2025-07-02T08:02:55.2474182Z ##[group]Determining the default branch 2025-07-02T08:02:55.2476716Z Retrieving the default branch name 2025-07-02T08:02:55.5156232Z Default branch 'main' 2025-07-02T08:02:55.5157010Z ##[endgroup] 2025-07-02T08:02:55.5157431Z ##[group]Fetching the repository 2025-07-02T08:02:55.5163302Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +refs/heads/main:refs/remotes/origin/main 2025-07-02T08:02:56.0256026Z From https://github.com/pytorch/test-infra 2025-07-02T08:02:56.0256484Z * [new branch] main -> origin/main 2025-07-02T08:02:56.0288035Z ##[endgroup] 2025-07-02T08:02:56.0288478Z ##[group]Determining the checkout info 2025-07-02T08:02:56.0289402Z ##[endgroup] 2025-07-02T08:02:56.0295128Z [command]/usr/bin/git sparse-checkout disable 2025-07-02T08:02:56.0346844Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-07-02T08:02:56.0380342Z ##[group]Checking out the ref 2025-07-02T08:02:56.0384012Z [command]/usr/bin/git checkout --progress --force -B main refs/remotes/origin/main 2025-07-02T08:02:56.1885380Z Switched to a new branch 'main' 2025-07-02T08:02:56.1888015Z branch 'main' set up to track 'origin/main'. 2025-07-02T08:02:56.1901804Z ##[endgroup] 2025-07-02T08:02:56.1902282Z ##[group]Setting up auth for fetching submodules 2025-07-02T08:02:56.1906995Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-07-02T08:02:56.1959566Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-07-02T08:02:56.1996007Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-07-02T08:02:56.2032518Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-07-02T08:02:56.2063762Z ##[endgroup] 2025-07-02T08:02:56.2064149Z ##[group]Fetching submodules 2025-07-02T08:02:56.2067848Z [command]/usr/bin/git submodule sync --recursive 2025-07-02T08:02:56.2451788Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1 --recursive 2025-07-02T08:02:56.2845077Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-07-02T08:02:56.3230886Z ##[endgroup] 2025-07-02T08:02:56.3231340Z ##[group]Persisting credentials for submodules 2025-07-02T08:02:56.3235680Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-07-02T08:02:56.3617708Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-07-02T08:02:56.4001373Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-07-02T08:02:56.4387730Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-07-02T08:02:56.4769123Z ##[endgroup] 2025-07-02T08:02:56.4813908Z [command]/usr/bin/git log -1 --format=%H 2025-07-02T08:02:56.4845119Z 4e43bd8700fb3fac32b6155020e13e6033eb4bcb 2025-07-02T08:02:56.5084776Z Prepare all required actions 2025-07-02T08:02:56.5085233Z Getting action download info 2025-07-02T08:02:56.6127078Z Download action repository 'pytorch/test-infra@main' (SHA:4e43bd8700fb3fac32b6155020e13e6033eb4bcb) 2025-07-02T08:02:58.5505154Z Getting action download info 2025-07-02T08:02:58.6498253Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-07-02T08:02:58.8106850Z ##[group]Run ./test-infra/.github/actions/setup-linux 2025-07-02T08:02:58.8107199Z env: 2025-07-02T08:02:58.8107457Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:58.8107811Z REPOSITORY: pytorch/rl 2025-07-02T08:02:58.8108065Z PR_NUMBER: 3030 2025-07-02T08:02:58.8110488Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:58.8113442Z ##[endgroup] 2025-07-02T08:02:58.8197315Z ##[group]Run set -euo pipefail 2025-07-02T08:02:58.8197642Z set -euo pipefail 2025-07-02T08:02:58.8197922Z function get_ec2_metadata() { 2025-07-02T08:02:58.8198283Z  # Pulled from instance metadata endpoint for EC2 2025-07-02T08:02:58.8198933Z  # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html 2025-07-02T08:02:58.8199502Z  category=$1 2025-07-02T08:02:58.8200414Z  curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}" 2025-07-02T08:02:58.8201350Z } 2025-07-02T08:02:58.8201605Z echo "ami-id: $(get_ec2_metadata ami-id)" 2025-07-02T08:02:58.8202040Z echo "instance-id: $(get_ec2_metadata instance-id)" 2025-07-02T08:02:58.8202515Z echo "instance-type: $(get_ec2_metadata instance-type)" 2025-07-02T08:02:58.8202933Z echo "system info $(uname -a)" 2025-07-02T08:02:58.8212683Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:02:58.8213056Z env: 2025-07-02T08:02:58.8213305Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:58.8213661Z REPOSITORY: pytorch/rl 2025-07-02T08:02:58.8213910Z PR_NUMBER: 3030 2025-07-02T08:02:58.8216416Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:58.8218859Z ##[endgroup] 2025-07-02T08:02:58.8396316Z ami-id: ami-05ffe3c48a9991133 2025-07-02T08:02:58.8523993Z instance-id: i-07e2f8e738f67d521 2025-07-02T08:02:58.8670387Z instance-type: g5.4xlarge 2025-07-02T08:02:58.8686240Z system info Linux ip-10-0-48-158.ec2.internal 6.1.141-155.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Jun 17 10:29:47 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-07-02T08:02:58.8728434Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-07-02T08:02:58.8729393Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-07-02T08:02:58.8756095Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:02:58.8756514Z env: 2025-07-02T08:02:58.8756778Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:58.8757124Z REPOSITORY: pytorch/rl 2025-07-02T08:02:58.8757371Z PR_NUMBER: 3030 2025-07-02T08:02:58.8759766Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:58.8762207Z ##[endgroup] 2025-07-02T08:02:58.8855773Z ##[group]Run if systemctl is-active --quiet docker; then 2025-07-02T08:02:58.8856218Z if systemctl is-active --quiet docker; then 2025-07-02T08:02:58.8856792Z  echo "Docker daemon is running..."; 2025-07-02T08:02:58.8857118Z else 2025-07-02T08:02:58.8857461Z  echo "Starting docker deamon..." && sudo systemctl start docker; 2025-07-02T08:02:58.8857881Z fi 2025-07-02T08:02:58.8867058Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:02:58.8867433Z env: 2025-07-02T08:02:58.8867679Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:58.8868033Z REPOSITORY: pytorch/rl 2025-07-02T08:02:58.8868275Z PR_NUMBER: 3030 2025-07-02T08:02:58.8870683Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:58.8873134Z ##[endgroup] 2025-07-02T08:02:58.8975455Z Docker daemon is running... 2025-07-02T08:02:58.9010188Z ##[group]Run AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") 2025-07-02T08:02:58.9011041Z AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") 2025-07-02T08:02:58.9011569Z retry () { "$@" || (sleep 1 && "$@") || (sleep 2 && "$@") } 2025-07-02T08:02:58.9012212Z retry aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ 2025-07-02T08:02:58.9012964Z  --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" 2025-07-02T08:02:58.9021937Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:02:58.9022324Z env: 2025-07-02T08:02:58.9022583Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:58.9022947Z REPOSITORY: pytorch/rl 2025-07-02T08:02:58.9023202Z PR_NUMBER: 3030 2025-07-02T08:02:58.9025787Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:58.9028257Z AWS_RETRY_MODE: standard 2025-07-02T08:02:58.9028548Z AWS_MAX_ATTEMPTS: 5 2025-07-02T08:02:58.9028800Z AWS_DEFAULT_REGION: us-east-1 2025-07-02T08:02:58.9029075Z ##[endgroup] 2025-07-02T08:02:59.9727078Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-07-02T08:02:59.9727796Z Configure a credential helper to remove this warning. See 2025-07-02T08:02:59.9728431Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-07-02T08:02:59.9728938Z 2025-07-02T08:02:59.9729070Z Login Succeeded 2025-07-02T08:02:59.9976028Z ##[group]Run env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-07-02T08:02:59.9976617Z env | grep '^GITHUB' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-07-02T08:02:59.9977132Z env | grep '^CI' >> "${RUNNER_TEMP}/github_env_${GITHUB_RUN_ID}" 2025-07-02T08:02:59.9986385Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:02:59.9986746Z env: 2025-07-02T08:02:59.9987002Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:02:59.9987357Z REPOSITORY: pytorch/rl 2025-07-02T08:02:59.9987816Z PR_NUMBER: 3030 2025-07-02T08:02:59.9990220Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:02:59.9992674Z ##[endgroup] 2025-07-02T08:03:00.0099315Z ##[group]Run RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts" 2025-07-02T08:03:00.0099809Z RUNNER_ARTIFACT_DIR="${RUNNER_TEMP}/artifacts" 2025-07-02T08:03:00.0100209Z sudo rm -rf "${RUNNER_ARTIFACT_DIR}" 2025-07-02T08:03:00.0100577Z mkdir -p "${RUNNER_ARTIFACT_DIR}" 2025-07-02T08:03:00.0101034Z echo "RUNNER_ARTIFACT_DIR=${RUNNER_ARTIFACT_DIR}" >> "${GITHUB_ENV}" 2025-07-02T08:03:00.0101461Z  2025-07-02T08:03:00.0101765Z RUNNER_TEST_RESULTS_DIR="${RUNNER_TEMP}/test-results" 2025-07-02T08:03:00.0102193Z sudo rm -rf "${RUNNER_TEST_RESULTS_DIR}" 2025-07-02T08:03:00.0102564Z mkdir -p "${RUNNER_TEST_RESULTS_DIR}" 2025-07-02T08:03:00.0103061Z echo "RUNNER_TEST_RESULTS_DIR=${RUNNER_TEST_RESULTS_DIR}" >> "${GITHUB_ENV}" 2025-07-02T08:03:00.0103512Z  2025-07-02T08:03:00.0103749Z RUNNER_DOCS_DIR="${RUNNER_TEMP}/docs" 2025-07-02T08:03:00.0104099Z sudo rm -rf "${RUNNER_DOCS_DIR}" 2025-07-02T08:03:00.0104429Z mkdir -p "${RUNNER_DOCS_DIR}" 2025-07-02T08:03:00.0104830Z echo "RUNNER_DOCS_DIR=${RUNNER_DOCS_DIR}" >> "${GITHUB_ENV}" 2025-07-02T08:03:00.0114568Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:03:00.0114940Z env: 2025-07-02T08:03:00.0115191Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:03:00.0115545Z REPOSITORY: pytorch/rl 2025-07-02T08:03:00.0115787Z PR_NUMBER: 3030 2025-07-02T08:03:00.0118344Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:03:00.0120799Z ##[endgroup] 2025-07-02T08:03:00.4957182Z ##[group]Run needs=0 2025-07-02T08:03:00.4957509Z needs=0 2025-07-02T08:03:00.4958005Z if lspci -v | grep -e 'controller.*NVIDIA' >/dev/null 2>/dev/null; then 2025-07-02T08:03:00.4958482Z  needs=1 2025-07-02T08:03:00.4958857Z fi 2025-07-02T08:03:00.4959157Z echo "does=${needs}" >> $GITHUB_OUTPUT 2025-07-02T08:03:00.4984818Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:03:00.4985228Z env: 2025-07-02T08:03:00.4985477Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:03:00.4985822Z REPOSITORY: pytorch/rl 2025-07-02T08:03:00.4986068Z PR_NUMBER: 3030 2025-07-02T08:03:00.4988478Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:03:00.4991436Z RUNNER_ARTIFACT_DIR: /home/ec2-user/actions-runner/_work/_temp/artifacts 2025-07-02T08:03:00.4992017Z RUNNER_TEST_RESULTS_DIR: /home/ec2-user/actions-runner/_work/_temp/test-results 2025-07-02T08:03:00.4992561Z RUNNER_DOCS_DIR: /home/ec2-user/actions-runner/_work/_temp/docs 2025-07-02T08:03:00.4992933Z ##[endgroup] 2025-07-02T08:03:00.5342502Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-07-02T08:03:00.5342910Z with: 2025-07-02T08:03:00.5343144Z driver-version: 570.133.07 2025-07-02T08:03:00.5343396Z env: 2025-07-02T08:03:00.5343649Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:03:00.5344001Z REPOSITORY: pytorch/rl 2025-07-02T08:03:00.5344256Z PR_NUMBER: 3030 2025-07-02T08:03:00.5346712Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:03:00.5349325Z RUNNER_ARTIFACT_DIR: /home/ec2-user/actions-runner/_work/_temp/artifacts 2025-07-02T08:03:00.5349927Z RUNNER_TEST_RESULTS_DIR: /home/ec2-user/actions-runner/_work/_temp/test-results 2025-07-02T08:03:00.5350493Z RUNNER_DOCS_DIR: /home/ec2-user/actions-runner/_work/_temp/docs 2025-07-02T08:03:00.5350877Z ##[endgroup] 2025-07-02T08:03:00.5396003Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-07-02T08:03:00.5396427Z with: 2025-07-02T08:03:00.5396644Z timeout_minutes: 10 2025-07-02T08:03:00.5396883Z max_attempts: 3 2025-07-02T08:03:00.5426641Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y \ nvidia-docker2 \ nvidia-container-toolkit-1.16.2 \ libnvidia-container-tools-1.16.2 \ libnvidia-container1-1.16.2 \ nvidia-container-toolkit-base-1.16.2 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi # check if the container-toolkit is correctly installed and CUDA is available inside a container docker run --rm -t --gpus=all public.ecr.aws/docker/library/python:3.13 nvidia-smi 2025-07-02T08:03:00.5455772Z retry_wait_seconds: 10 2025-07-02T08:03:00.5456037Z polling_interval_seconds: 1 2025-07-02T08:03:00.5456313Z warning_on_retry: true 2025-07-02T08:03:00.5456563Z continue_on_error: false 2025-07-02T08:03:00.5456808Z env: 2025-07-02T08:03:00.5457055Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:03:00.5457409Z REPOSITORY: pytorch/rl 2025-07-02T08:03:00.5457658Z PR_NUMBER: 3030 2025-07-02T08:03:00.5460040Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:03:00.5462634Z RUNNER_ARTIFACT_DIR: /home/ec2-user/actions-runner/_work/_temp/artifacts 2025-07-02T08:03:00.5463232Z RUNNER_TEST_RESULTS_DIR: /home/ec2-user/actions-runner/_work/_temp/test-results 2025-07-02T08:03:00.5463784Z RUNNER_DOCS_DIR: /home/ec2-user/actions-runner/_work/_temp/docs 2025-07-02T08:03:00.5464186Z DRIVER_VERSION: 570.133.07 2025-07-02T08:03:00.5464447Z ##[endgroup] 2025-07-02T08:03:00.6296972Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run == 2025-07-02T08:03:00.6298009Z + pre_install_nvidia_driver_amzn2 2025-07-02T08:03:00.6298354Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-07-02T08:03:00.9564643Z No match for argument: nvidia-driver-latest-dkms 2025-07-02T08:03:00.9565029Z No packages marked for removal. 2025-07-02T08:03:00.9642984Z Dependencies resolved. 2025-07-02T08:03:00.9652709Z Nothing to do. 2025-07-02T08:03:00.9654362Z Complete! 2025-07-02T08:03:00.9994805Z + install_nvidia_driver_common 2025-07-02T08:03:01.0000163Z + echo 'Before installing NVIDIA driver' 2025-07-02T08:03:01.0000477Z + lspci 2025-07-02T08:03:01.0000739Z Before installing NVIDIA driver 2025-07-02T08:03:01.0130951Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-07-02T08:03:01.0132008Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-07-02T08:03:01.0133201Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-07-02T08:03:01.0134925Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-07-02T08:03:01.0135789Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-07-02T08:03:01.0136404Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-07-02T08:03:01.0136933Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-07-02T08:03:01.0137465Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-07-02T08:03:01.0137903Z + lsmod 2025-07-02T08:03:01.0181885Z Module Size Used by 2025-07-02T08:03:01.0182757Z xt_conntrack 16384 1 2025-07-02T08:03:01.0183456Z nft_chain_nat 16384 3 2025-07-02T08:03:01.0184172Z xt_MASQUERADE 20480 1 2025-07-02T08:03:01.0184895Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-07-02T08:03:01.0185392Z nf_conntrack_netlink 57344 0 2025-07-02T08:03:01.0185826Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-07-02T08:03:01.0186303Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-07-02T08:03:01.0186632Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-07-02T08:03:01.0186938Z xfrm_user 57344 1 2025-07-02T08:03:01.0187216Z xfrm_algo 16384 1 xfrm_user 2025-07-02T08:03:01.0187708Z xt_addrtype 16384 2 2025-07-02T08:03:01.0187979Z nft_compat 20480 4 2025-07-02T08:03:01.0188291Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-07-02T08:03:01.0188743Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-07-02T08:03:01.0189153Z br_netfilter 36864 0 2025-07-02T08:03:01.0189438Z bridge 323584 1 br_netfilter 2025-07-02T08:03:01.0189755Z stp 16384 1 bridge 2025-07-02T08:03:01.0190050Z llc 16384 2 bridge,stp 2025-07-02T08:03:01.0190363Z overlay 167936 0 2025-07-02T08:03:01.0190622Z tls 139264 0 2025-07-02T08:03:01.0190896Z nls_ascii 16384 1 2025-07-02T08:03:01.0191159Z nls_cp437 20480 1 2025-07-02T08:03:01.0191411Z vfat 24576 1 2025-07-02T08:03:01.0191675Z fat 86016 1 vfat 2025-07-02T08:03:01.0191951Z sunrpc 700416 1 2025-07-02T08:03:01.0192221Z i8042 45056 0 2025-07-02T08:03:01.0192471Z ena 180224 0 2025-07-02T08:03:01.0192735Z serio 28672 3 i8042 2025-07-02T08:03:01.0193016Z button 24576 0 2025-07-02T08:03:01.0193289Z ghash_clmulni_intel 16384 0 2025-07-02T08:03:01.0193562Z sch_fq_codel 20480 17 2025-07-02T08:03:01.0193836Z fuse 184320 1 2025-07-02T08:03:01.0194098Z dm_mod 188416 0 2025-07-02T08:03:01.0194355Z loop 36864 0 2025-07-02T08:03:01.0194616Z configfs 57344 1 2025-07-02T08:03:01.0194876Z dmi_sysfs 20480 0 2025-07-02T08:03:01.0195144Z crc32_pclmul 16384 0 2025-07-02T08:03:01.0195410Z crc32c_intel 24576 0 2025-07-02T08:03:01.0195677Z efivarfs 24576 1 2025-07-02T08:03:01.0195935Z + modinfo nvidia 2025-07-02T08:03:01.0201871Z filename: /lib/modules/6.1.141-155.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-07-02T08:03:01.0202552Z import_ns: DMA_BUF 2025-07-02T08:03:01.0202886Z alias: char-major-195-* 2025-07-02T08:03:01.0203259Z version: 570.133.07 2025-07-02T08:03:01.0203550Z supported: external 2025-07-02T08:03:01.0203902Z license: Dual MIT/GPL 2025-07-02T08:03:01.0204298Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-07-02T08:03:01.0204775Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-07-02T08:03:01.0205182Z srcversion: 49515739FD8F721A3F2F714 2025-07-02T08:03:01.0205513Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-07-02T08:03:01.0205880Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-07-02T08:03:01.0206235Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-07-02T08:03:01.0206692Z depends: i2c-core,drm 2025-07-02T08:03:01.0206953Z retpoline: Y 2025-07-02T08:03:01.0207180Z name: nvidia 2025-07-02T08:03:01.0207559Z vermagic: 6.1.141-155.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-07-02T08:03:01.0208177Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-07-02T08:03:01.0208806Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-07-02T08:03:01.0209338Z parm: NVreg_ResmanDebugLevel:int 2025-07-02T08:03:01.0209665Z parm: NVreg_RmLogonRC:int 2025-07-02T08:03:01.0209978Z parm: NVreg_ModifyDeviceFiles:int 2025-07-02T08:03:01.0210317Z parm: NVreg_DeviceFileUID:int 2025-07-02T08:03:01.0210631Z parm: NVreg_DeviceFileGID:int 2025-07-02T08:03:01.0211228Z parm: NVreg_DeviceFileMode:int 2025-07-02T08:03:01.0211725Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-07-02T08:03:01.0212253Z parm: NVreg_UsePageAttributeTable:int 2025-07-02T08:03:01.0212629Z parm: NVreg_EnablePCIeGen3:int 2025-07-02T08:03:01.0213053Z parm: NVreg_EnableMSI:int 2025-07-02T08:03:01.0213474Z parm: NVreg_EnableStreamMemOPs:int 2025-07-02T08:03:01.0213963Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-07-02T08:03:01.0214571Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-07-02T08:03:01.0215215Z parm: NVreg_EnableS0ixPowerManagement:int 2025-07-02T08:03:01.0215793Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-07-02T08:03:01.0216311Z parm: NVreg_DynamicPowerManagement:int 2025-07-02T08:03:01.0216769Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-07-02T08:03:01.0217217Z parm: NVreg_EnableGpuFirmware:int 2025-07-02T08:03:01.0217577Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-07-02T08:03:01.0217980Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-07-02T08:03:01.0218389Z parm: NVreg_EnableUserNUMAManagement:int 2025-07-02T08:03:01.0218760Z parm: NVreg_MemoryPoolSize:int 2025-07-02T08:03:01.0219102Z parm: NVreg_KMallocHeapMaxSize:int 2025-07-02T08:03:01.0219460Z parm: NVreg_VMallocHeapMaxSize:int 2025-07-02T08:03:01.0219801Z parm: NVreg_IgnoreMMIOCheck:int 2025-07-02T08:03:01.0220142Z parm: NVreg_NvLinkDisable:int 2025-07-02T08:03:01.0220522Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-07-02T08:03:01.0220910Z parm: NVreg_RegisterPCIDriver:int 2025-07-02T08:03:01.0221268Z parm: NVreg_EnableResizableBar:int 2025-07-02T08:03:01.0221626Z parm: NVreg_EnableDbgBreakpoint:int 2025-07-02T08:03:01.0222001Z parm: NVreg_EnableNonblockingOpen:int 2025-07-02T08:03:01.0222365Z parm: NVreg_RegistryDwords:charp 2025-07-02T08:03:01.0222736Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-07-02T08:03:01.0223093Z parm: NVreg_RmMsg:charp 2025-07-02T08:03:01.0223411Z parm: NVreg_GpuBlacklist:charp 2025-07-02T08:03:01.0223767Z parm: NVreg_TemporaryFilePath:charp 2025-07-02T08:03:01.0224117Z parm: NVreg_ExcludedGpus:charp 2025-07-02T08:03:01.0224460Z parm: NVreg_DmaRemapPeerMmio:int 2025-07-02T08:03:01.0224810Z parm: NVreg_RmNvlinkBandwidth:charp 2025-07-02T08:03:01.0225202Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-07-02T08:03:01.0225576Z parm: NVreg_ImexChannelCount:int 2025-07-02T08:03:01.0225930Z parm: NVreg_CreateImexChannel0:int 2025-07-02T08:03:01.0226301Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-07-02T08:03:01.0226674Z parm: rm_firmware_active:charp 2025-07-02T08:03:01.0226996Z + HAS_NVIDIA_DRIVER=0 2025-07-02T08:03:01.0227242Z ++ command -v nvidia-smi 2025-07-02T08:03:01.0227511Z + '[' -x /usr/bin/nvidia-smi ']' 2025-07-02T08:03:01.0227774Z + set +e 2025-07-02T08:03:01.0228100Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-07-02T08:03:02.8461551Z + INSTALLED_DRIVER_VERSION=570.133.07 2025-07-02T08:03:02.8462065Z + NVIDIA_SMI_STATUS=0 2025-07-02T08:03:02.8462398Z + '[' 0 -ne 0 ']' 2025-07-02T08:03:02.8462692Z + '[' 570.133.07 '!=' 570.133.07 ']' 2025-07-02T08:03:02.8463083Z + HAS_NVIDIA_DRIVER=1 2025-07-02T08:03:02.8463579Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation' 2025-07-02T08:03:02.8493825Z + set -e 2025-07-02T08:03:02.8494154Z + '[' 1 -eq 0 ']' 2025-07-02T08:03:02.8494892Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation 2025-07-02T08:03:02.8495645Z + post_install_nvidia_driver_common 2025-07-02T08:03:02.8495977Z + sudo modprobe nvidia 2025-07-02T08:03:02.9908864Z + echo 'After installing NVIDIA driver' 2025-07-02T08:03:02.9909292Z + lspci 2025-07-02T08:03:02.9909595Z After installing NVIDIA driver 2025-07-02T08:03:03.0035380Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-07-02T08:03:03.0036118Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-07-02T08:03:03.0036825Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-07-02T08:03:03.0037406Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-07-02T08:03:03.0038425Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-07-02T08:03:03.0039156Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-07-02T08:03:03.0039682Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1) 2025-07-02T08:03:03.0040195Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-07-02T08:03:03.0040633Z + lsmod 2025-07-02T08:03:03.0075868Z Module Size Used by 2025-07-02T08:03:03.0076322Z nvidia_uvm 1884160 0 2025-07-02T08:03:03.0076697Z nvidia 11583488 1 nvidia_uvm 2025-07-02T08:03:03.0077097Z drm 602112 1 nvidia 2025-07-02T08:03:03.0077439Z drm_panel_orientation_quirks 32768 1 drm 2025-07-02T08:03:03.0077761Z backlight 24576 1 drm 2025-07-02T08:03:03.0078118Z i2c_core 110592 2 nvidia,drm 2025-07-02T08:03:03.0078537Z xt_conntrack 16384 1 2025-07-02T08:03:03.0078898Z nft_chain_nat 16384 3 2025-07-02T08:03:03.0079269Z xt_MASQUERADE 20480 1 2025-07-02T08:03:03.0079650Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-07-02T08:03:03.0080006Z nf_conntrack_netlink 57344 0 2025-07-02T08:03:03.0080422Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-07-02T08:03:03.0080893Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-07-02T08:03:03.0081213Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-07-02T08:03:03.0081521Z xfrm_user 57344 1 2025-07-02T08:03:03.0081797Z xfrm_algo 16384 1 xfrm_user 2025-07-02T08:03:03.0082092Z xt_addrtype 16384 2 2025-07-02T08:03:03.0082365Z nft_compat 20480 4 2025-07-02T08:03:03.0082675Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-07-02T08:03:03.0083116Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-07-02T08:03:03.0083515Z br_netfilter 36864 0 2025-07-02T08:03:03.0083850Z bridge 323584 1 br_netfilter 2025-07-02T08:03:03.0084150Z stp 16384 1 bridge 2025-07-02T08:03:03.0084445Z llc 16384 2 bridge,stp 2025-07-02T08:03:03.0084734Z overlay 167936 0 2025-07-02T08:03:03.0084993Z tls 139264 0 2025-07-02T08:03:03.0085256Z nls_ascii 16384 1 2025-07-02T08:03:03.0085514Z nls_cp437 20480 1 2025-07-02T08:03:03.0085768Z vfat 24576 1 2025-07-02T08:03:03.0086028Z fat 86016 1 vfat 2025-07-02T08:03:03.0086335Z sunrpc 700416 1 2025-07-02T08:03:03.0086610Z i8042 45056 0 2025-07-02T08:03:03.0086852Z ena 180224 0 2025-07-02T08:03:03.0087428Z serio 28672 3 i8042 2025-07-02T08:03:03.0087709Z button 24576 0 2025-07-02T08:03:03.0087973Z ghash_clmulni_intel 16384 0 2025-07-02T08:03:03.0088240Z sch_fq_codel 20480 17 2025-07-02T08:03:03.0088506Z fuse 184320 1 2025-07-02T08:03:03.0088756Z dm_mod 188416 0 2025-07-02T08:03:03.0089011Z loop 36864 0 2025-07-02T08:03:03.0089258Z configfs 57344 1 2025-07-02T08:03:03.0089518Z dmi_sysfs 20480 0 2025-07-02T08:03:03.0089777Z crc32_pclmul 16384 0 2025-07-02T08:03:03.0090030Z crc32c_intel 24576 0 2025-07-02T08:03:03.0090287Z efivarfs 24576 1 2025-07-02T08:03:03.0090541Z + modinfo nvidia 2025-07-02T08:03:03.0095791Z filename: /lib/modules/6.1.141-155.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-07-02T08:03:03.0096512Z import_ns: DMA_BUF 2025-07-02T08:03:03.0096858Z alias: char-major-195-* 2025-07-02T08:03:03.0097186Z version: 570.133.07 2025-07-02T08:03:03.0097434Z supported: external 2025-07-02T08:03:03.0097689Z license: Dual MIT/GPL 2025-07-02T08:03:03.0097978Z firmware: nvidia/570.133.07/gsp_tu10x.bin 2025-07-02T08:03:03.0098334Z firmware: nvidia/570.133.07/gsp_ga10x.bin 2025-07-02T08:03:03.0098784Z srcversion: 49515739FD8F721A3F2F714 2025-07-02T08:03:03.0099113Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-07-02T08:03:03.0099468Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-07-02T08:03:03.0099817Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-07-02T08:03:03.0100143Z depends: i2c-core,drm 2025-07-02T08:03:03.0100398Z retpoline: Y 2025-07-02T08:03:03.0100620Z name: nvidia 2025-07-02T08:03:03.0100988Z vermagic: 6.1.141-155.222.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-07-02T08:03:03.0101534Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-07-02T08:03:03.0102169Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-07-02T08:03:03.0102758Z parm: NVreg_ResmanDebugLevel:int 2025-07-02T08:03:03.0103133Z parm: NVreg_RmLogonRC:int 2025-07-02T08:03:03.0103441Z parm: NVreg_ModifyDeviceFiles:int 2025-07-02T08:03:03.0103778Z parm: NVreg_DeviceFileUID:int 2025-07-02T08:03:03.0104090Z parm: NVreg_DeviceFileGID:int 2025-07-02T08:03:03.0104408Z parm: NVreg_DeviceFileMode:int 2025-07-02T08:03:03.0104782Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-07-02T08:03:03.0105194Z parm: NVreg_UsePageAttributeTable:int 2025-07-02T08:03:03.0105539Z parm: NVreg_EnablePCIeGen3:int 2025-07-02T08:03:03.0105856Z parm: NVreg_EnableMSI:int 2025-07-02T08:03:03.0106177Z parm: NVreg_EnableStreamMemOPs:int 2025-07-02T08:03:03.0106556Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-07-02T08:03:03.0106981Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-07-02T08:03:03.0107384Z parm: NVreg_EnableS0ixPowerManagement:int 2025-07-02T08:03:03.0107825Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-07-02T08:03:03.0108253Z parm: NVreg_DynamicPowerManagement:int 2025-07-02T08:03:03.0108706Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-07-02T08:03:03.0109138Z parm: NVreg_EnableGpuFirmware:int 2025-07-02T08:03:03.0109494Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-07-02T08:03:03.0109886Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-07-02T08:03:03.0110276Z parm: NVreg_EnableUserNUMAManagement:int 2025-07-02T08:03:03.0110637Z parm: NVreg_MemoryPoolSize:int 2025-07-02T08:03:03.0111139Z parm: NVreg_KMallocHeapMaxSize:int 2025-07-02T08:03:03.0111489Z parm: NVreg_VMallocHeapMaxSize:int 2025-07-02T08:03:03.0111822Z parm: NVreg_IgnoreMMIOCheck:int 2025-07-02T08:03:03.0112148Z parm: NVreg_NvLinkDisable:int 2025-07-02T08:03:03.0112663Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-07-02T08:03:03.0113052Z parm: NVreg_RegisterPCIDriver:int 2025-07-02T08:03:03.0113397Z parm: NVreg_EnableResizableBar:int 2025-07-02T08:03:03.0113745Z parm: NVreg_EnableDbgBreakpoint:int 2025-07-02T08:03:03.0114119Z parm: NVreg_EnableNonblockingOpen:int 2025-07-02T08:03:03.0114467Z parm: NVreg_RegistryDwords:charp 2025-07-02T08:03:03.0114827Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-07-02T08:03:03.0115171Z parm: NVreg_RmMsg:charp 2025-07-02T08:03:03.0115468Z parm: NVreg_GpuBlacklist:charp 2025-07-02T08:03:03.0115801Z parm: NVreg_TemporaryFilePath:charp 2025-07-02T08:03:03.0116178Z parm: NVreg_ExcludedGpus:charp 2025-07-02T08:03:03.0116523Z parm: NVreg_DmaRemapPeerMmio:int 2025-07-02T08:03:03.0116861Z parm: NVreg_RmNvlinkBandwidth:charp 2025-07-02T08:03:03.0117244Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-07-02T08:03:03.0117608Z parm: NVreg_ImexChannelCount:int 2025-07-02T08:03:03.0117949Z parm: NVreg_CreateImexChannel0:int 2025-07-02T08:03:03.0118308Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-07-02T08:03:03.0118669Z parm: rm_firmware_active:charp 2025-07-02T08:03:03.0119091Z + set +e 2025-07-02T08:03:03.0119285Z + nvidia-smi 2025-07-02T08:03:04.4024385Z Wed Jul 2 08:03:04 2025 2025-07-02T08:03:04.4024800Z +-----------------------------------------------------------------------------------------+ 2025-07-02T08:03:04.4025332Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-07-02T08:03:04.4025853Z |-----------------------------------------+------------------------+----------------------+ 2025-07-02T08:03:04.4026372Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-07-02T08:03:04.4026971Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-07-02T08:03:04.4027440Z | | | MIG M. | 2025-07-02T08:03:04.4027781Z |=========================================+========================+======================| 2025-07-02T08:03:04.4090968Z | 0 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 | 2025-07-02T08:03:04.4091432Z | 0% 34C P0 66W / 300W | 0MiB / 23028MiB | 4% Default | 2025-07-02T08:03:04.4091830Z | | | N/A | 2025-07-02T08:03:04.4092238Z +-----------------------------------------+------------------------+----------------------+ 2025-07-02T08:03:04.4092847Z 2025-07-02T08:03:04.4093290Z +-----------------------------------------------------------------------------------------+ 2025-07-02T08:03:04.4093755Z | Processes: | 2025-07-02T08:03:04.4094226Z | GPU GI CI PID Type Process name GPU Memory | 2025-07-02T08:03:04.4094653Z | ID ID Usage | 2025-07-02T08:03:04.4095072Z |=========================================================================================| 2025-07-02T08:03:04.4096109Z | No running processes found | 2025-07-02T08:03:04.4096791Z +-----------------------------------------------------------------------------------------+ 2025-07-02T08:03:04.8179844Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-07-02T08:03:06.2134848Z NVIDIA A10G 2025-07-02T08:03:06.4787485Z + NVIDIA_SMI_STATUS=0 2025-07-02T08:03:06.4787744Z + '[' 0 -eq 0 ']' 2025-07-02T08:03:06.4788122Z + echo 'INFO: Ignoring allowed status 0' 2025-07-02T08:03:06.4788684Z + set -e 2025-07-02T08:03:06.4788907Z INFO: Ignoring allowed status 0 2025-07-02T08:03:06.4796678Z == Installing nvidia container toolkit for amzn2023 == 2025-07-02T08:03:06.4801168Z + sudo yum install -y yum-utils 2025-07-02T08:03:06.9120357Z Last metadata expiration check: 0:00:53 ago on Wed Jul 2 08:02:13 2025. 2025-07-02T08:03:06.9362685Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-07-02T08:03:06.9839190Z Dependencies resolved. 2025-07-02T08:03:07.0063819Z Nothing to do. 2025-07-02T08:03:07.0064241Z Complete! 2025-07-02T08:03:07.0446790Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-07-02T08:03:07.0447415Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-07-02T08:03:07.0448423Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-07-02T08:03:07.3269037Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-07-02T08:03:07.3726381Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2 libnvidia-container-tools-1.16.2 libnvidia-container1-1.16.2 nvidia-container-toolkit-base-1.16.2 2025-07-02T08:03:07.8723289Z nvidia-container-toolkit 20 kB/s | 833 B 00:00 2025-07-02T08:03:07.8972812Z Package nvidia-docker2-2.14.0-1.noarch is already installed. 2025-07-02T08:03:07.9468994Z Dependencies resolved. 2025-07-02T08:03:07.9693964Z ================================================================================ 2025-07-02T08:03:07.9694514Z Package Arch Version Repository Size 2025-07-02T08:03:07.9695176Z ================================================================================ 2025-07-02T08:03:07.9695657Z Downgrading: 2025-07-02T08:03:07.9696212Z libnvidia-container-tools x86_64 1.16.2-1 nvidia-container-toolkit 39 k 2025-07-02T08:03:07.9697142Z libnvidia-container1 x86_64 1.16.2-1 nvidia-container-toolkit 1.0 M 2025-07-02T08:03:07.9698142Z nvidia-container-toolkit x86_64 1.16.2-1 nvidia-container-toolkit 1.2 M 2025-07-02T08:03:07.9699133Z nvidia-container-toolkit-base x86_64 1.16.2-1 nvidia-container-toolkit 5.6 M 2025-07-02T08:03:07.9699721Z 2025-07-02T08:03:07.9699882Z Transaction Summary 2025-07-02T08:03:07.9700270Z ================================================================================ 2025-07-02T08:03:07.9700858Z Downgrade 4 Packages 2025-07-02T08:03:07.9701160Z 2025-07-02T08:03:07.9701339Z Total download size: 7.8 M 2025-07-02T08:03:07.9701851Z Downloading Packages: 2025-07-02T08:03:07.9856651Z (1/4): libnvidia-container-tools-1.16.2-1.x86_6 2.7 MB/s | 39 kB 00:00 2025-07-02T08:03:08.0024207Z (2/4): libnvidia-container1-1.16.2-1.x86_64.rpm 32 MB/s | 1.0 MB 00:00 2025-07-02T08:03:08.0164358Z (3/4): nvidia-container-toolkit-1.16.2-1.x86_64 28 MB/s | 1.2 MB 00:00 2025-07-02T08:03:08.0470884Z (4/4): nvidia-container-toolkit-base-1.16.2-1.x 92 MB/s | 5.6 MB 00:00 2025-07-02T08:03:08.0480911Z -------------------------------------------------------------------------------- 2025-07-02T08:03:08.0483620Z Total 101 MB/s | 7.8 MB 00:00 2025-07-02T08:03:08.0486129Z Running transaction check 2025-07-02T08:03:08.0598548Z Transaction check succeeded. 2025-07-02T08:03:08.0598843Z Running transaction test 2025-07-02T08:03:08.1007973Z Transaction test succeeded. 2025-07-02T08:03:08.1010702Z Running transaction 2025-07-02T08:03:08.6671072Z Preparing : 1/1 2025-07-02T08:03:08.7569851Z Downgrading : nvidia-container-toolkit-base-1.16.2-1.x86_64 1/8 2025-07-02T08:03:08.7606161Z Downgrading : libnvidia-container1-1.16.2-1.x86_64 2/8 2025-07-02T08:03:08.7878110Z Running scriptlet: libnvidia-container1-1.16.2-1.x86_64 2/8 2025-07-02T08:03:08.8950707Z Downgrading : libnvidia-container-tools-1.16.2-1.x86_64 3/8 2025-07-02T08:03:08.8989836Z Downgrading : nvidia-container-toolkit-1.16.2-1.x86_64 4/8 2025-07-02T08:03:08.9219410Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 4/8 2025-07-02T08:03:08.9220042Z Cleanup : nvidia-container-toolkit-1.17.8-1.x86_64 5/8 2025-07-02T08:03:08.9334161Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 5/8 2025-07-02T08:03:08.9367505Z Cleanup : libnvidia-container-tools-1.17.8-1.x86_64 6/8 2025-07-02T08:03:08.9368674Z Cleanup : libnvidia-container1-1.17.8-1.x86_64 7/8 2025-07-02T08:03:08.9600946Z Running scriptlet: libnvidia-container1-1.17.8-1.x86_64 7/8 2025-07-02T08:03:08.9624815Z Cleanup : nvidia-container-toolkit-base-1.17.8-1.x86_64 8/8 2025-07-02T08:03:09.0220652Z Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64 8/8 2025-07-02T08:03:09.2406319Z Running scriptlet: nvidia-container-toolkit-base-1.17.8-1.x86_64 8/8 2025-07-02T08:03:09.2407634Z Verifying : libnvidia-container-tools-1.16.2-1.x86_64 1/8 2025-07-02T08:03:09.2408385Z Verifying : libnvidia-container-tools-1.17.8-1.x86_64 2/8 2025-07-02T08:03:09.2409299Z Verifying : libnvidia-container1-1.16.2-1.x86_64 3/8 2025-07-02T08:03:09.2409862Z Verifying : libnvidia-container1-1.17.8-1.x86_64 4/8 2025-07-02T08:03:09.2410434Z Verifying : nvidia-container-toolkit-1.16.2-1.x86_64 5/8 2025-07-02T08:03:09.2411163Z Verifying : nvidia-container-toolkit-1.17.8-1.x86_64 6/8 2025-07-02T08:03:09.2411758Z Verifying : nvidia-container-toolkit-base-1.16.2-1.x86_64 7/8 2025-07-02T08:03:09.4151169Z Verifying : nvidia-container-toolkit-base-1.17.8-1.x86_64 8/8 2025-07-02T08:03:09.4151567Z 2025-07-02T08:03:09.4151659Z Downgraded: 2025-07-02T08:03:09.4152024Z libnvidia-container-tools-1.16.2-1.x86_64 2025-07-02T08:03:09.4152615Z libnvidia-container1-1.16.2-1.x86_64 2025-07-02T08:03:09.4153208Z nvidia-container-toolkit-1.16.2-1.x86_64 2025-07-02T08:03:09.4153821Z nvidia-container-toolkit-base-1.16.2-1.x86_64 2025-07-02T08:03:09.4154195Z 2025-07-02T08:03:09.4154284Z Complete! 2025-07-02T08:03:09.4617067Z + sudo systemctl restart docker 2025-07-02T08:03:13.4089505Z Wed Jul 2 08:03:13 2025 2025-07-02T08:03:13.4090099Z +-----------------------------------------------------------------------------------------+ 2025-07-02T08:03:13.4090855Z | NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 | 2025-07-02T08:03:13.4091505Z |-----------------------------------------+------------------------+----------------------+ 2025-07-02T08:03:13.4092033Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-07-02T08:03:13.4092604Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-07-02T08:03:13.4093066Z | | | MIG M. | 2025-07-02T08:03:13.4093408Z |=========================================+========================+======================| 2025-07-02T08:03:13.4175621Z | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | 2025-07-02T08:03:13.4176242Z | 0% 35C P0 66W / 300W | 0MiB / 23028MiB | 4% Default | 2025-07-02T08:03:13.4176705Z | | | N/A | 2025-07-02T08:03:13.4177253Z +-----------------------------------------+------------------------+----------------------+ 2025-07-02T08:03:13.4178132Z 2025-07-02T08:03:13.4178550Z +-----------------------------------------------------------------------------------------+ 2025-07-02T08:03:13.4179018Z | Processes: | 2025-07-02T08:03:13.4179527Z | GPU GI CI PID Type Process name GPU Memory | 2025-07-02T08:03:13.4179962Z | ID ID Usage | 2025-07-02T08:03:13.4180364Z |=========================================================================================| 2025-07-02T08:03:13.4180965Z | No running processes found | 2025-07-02T08:03:13.4181638Z +-----------------------------------------------------------------------------------------+ 2025-07-02T08:03:13.5798309Z Unable to find image 'public.ecr.aws/docker/library/python:3.13' locally 2025-07-02T08:03:13.7554518Z 3.13: Pulling from docker/library/python 2025-07-02T08:03:13.8484904Z c19952135643: Pulling fs layer 2025-07-02T08:03:13.8485329Z 7bbf972c6c2f: Pulling fs layer 2025-07-02T08:03:13.8485705Z 900e2c02f17f: Pulling fs layer 2025-07-02T08:03:13.8486278Z abe9c1abe6f3: Pulling fs layer 2025-07-02T08:03:13.8486554Z 562e9f67c041: Pulling fs layer 2025-07-02T08:03:13.8486828Z 8ae8ebad5c0e: Pulling fs layer 2025-07-02T08:03:13.8487129Z 5b1a73f6734a: Pulling fs layer 2025-07-02T08:03:13.8487448Z abe9c1abe6f3: Waiting 2025-07-02T08:03:13.8487841Z 562e9f67c041: Waiting 2025-07-02T08:03:13.8488071Z 5b1a73f6734a: Waiting 2025-07-02T08:03:13.8488291Z 8ae8ebad5c0e: Waiting 2025-07-02T08:03:13.9531522Z 7bbf972c6c2f: Verifying Checksum 2025-07-02T08:03:13.9531849Z 7bbf972c6c2f: Download complete 2025-07-02T08:03:14.0315056Z c19952135643: Verifying Checksum 2025-07-02T08:03:14.0316163Z c19952135643: Download complete 2025-07-02T08:03:14.0762426Z 900e2c02f17f: Verifying Checksum 2025-07-02T08:03:14.0762900Z 900e2c02f17f: Download complete 2025-07-02T08:03:14.1092483Z 562e9f67c041: Verifying Checksum 2025-07-02T08:03:14.1092855Z 562e9f67c041: Download complete 2025-07-02T08:03:14.1795414Z 5b1a73f6734a: Verifying Checksum 2025-07-02T08:03:14.1795881Z 5b1a73f6734a: Download complete 2025-07-02T08:03:14.2072733Z 8ae8ebad5c0e: Verifying Checksum 2025-07-02T08:03:14.2073157Z 8ae8ebad5c0e: Download complete 2025-07-02T08:03:14.7493675Z abe9c1abe6f3: Verifying Checksum 2025-07-02T08:03:14.7494155Z abe9c1abe6f3: Download complete 2025-07-02T08:03:15.9700662Z c19952135643: Pull complete 2025-07-02T08:03:16.5405644Z 7bbf972c6c2f: Pull complete 2025-07-02T08:03:18.8453401Z 900e2c02f17f: Pull complete 2025-07-02T08:03:22.5907902Z ##[error]The operation was canceled. 2025-07-02T08:03:22.6049221Z ##[group]Run pmeier/pytest-results-action@a2c1430e2bddadbad9f49a6f9b879f062c6b19b1 2025-07-02T08:03:22.6049736Z with: 2025-07-02T08:03:22.6050035Z path: /home/ec2-user/actions-runner/_work/_temp/test-results 2025-07-02T08:03:22.6050428Z fail-on-empty: false 2025-07-02T08:03:22.6050651Z env: 2025-07-02T08:03:22.6050919Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:03:22.6051315Z REPOSITORY: pytorch/rl 2025-07-02T08:03:22.6051646Z PR_NUMBER: 3030 2025-07-02T08:03:22.6054179Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:03:22.6056985Z RUNNER_ARTIFACT_DIR: /home/ec2-user/actions-runner/_work/_temp/artifacts 2025-07-02T08:03:22.6057596Z RUNNER_TEST_RESULTS_DIR: /home/ec2-user/actions-runner/_work/_temp/test-results 2025-07-02T08:03:22.6058162Z RUNNER_DOCS_DIR: /home/ec2-user/actions-runner/_work/_temp/docs 2025-07-02T08:03:22.6058627Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-07-02T08:03:22.6058987Z ##[endgroup] 2025-07-02T08:03:22.6908525Z ##[group]Run # Only do these steps if we actually want to upload an artifact 2025-07-02T08:03:22.6909161Z # Only do these steps if we actually want to upload an artifact 2025-07-02T08:03:22.6909638Z if [[ -n "${UPLOAD_ARTIFACT_NAME}" ]]; then 2025-07-02T08:03:22.6910205Z  # If the default execution path is followed then we should get a wheel in the dist/ folder 2025-07-02T08:03:22.6911107Z  # attempt to just grab whatever is in there and scoop it all up 2025-07-02T08:03:22.6911680Z  if find "dist/" -name "*.whl" >/dev/null 2>/dev/null; then 2025-07-02T08:03:22.6912133Z  mv -v dist/*.whl "${RUNNER_ARTIFACT_DIR}/" 2025-07-02T08:03:22.6912474Z  fi 2025-07-02T08:03:22.6912749Z  if [[ -d "artifacts-to-be-uploaded" ]]; then 2025-07-02T08:03:22.6913206Z  mv -v artifacts-to-be-uploaded/* "${RUNNER_ARTIFACT_DIR}/" 2025-07-02T08:03:22.6913794Z  fi 2025-07-02T08:03:22.6914035Z fi 2025-07-02T08:03:22.6914235Z  2025-07-02T08:03:22.6914451Z upload_docs=0 2025-07-02T08:03:22.6914854Z # Check if there are files in the documentation folder to upload, note that 2025-07-02T08:03:22.6915339Z # empty folders do not count 2025-07-02T08:03:22.6915803Z if find "${RUNNER_DOCS_DIR}" -mindepth 1 -maxdepth 1 -type f | read -r; then 2025-07-02T08:03:22.6916428Z  # TODO: Add a check here to test if on ec2 because if we're not on ec2 then this 2025-07-02T08:03:22.6916944Z  # upload will probably not work correctly 2025-07-02T08:03:22.6917292Z  upload_docs=1 2025-07-02T08:03:22.6917539Z fi 2025-07-02T08:03:22.6917854Z echo "upload-docs=${upload_docs}" >> "${GITHUB_OUTPUT}" 2025-07-02T08:03:22.6932951Z shell: /usr/bin/bash -e {0} 2025-07-02T08:03:22.6933218Z env: 2025-07-02T08:03:22.6933483Z DOCKER_IMAGE: nvidia/cuda:12.1.1-devel-ubuntu22.04 2025-07-02T08:03:22.6933850Z REPOSITORY: pytorch/rl 2025-07-02T08:03:22.6934106Z PR_NUMBER: 3030 2025-07-02T08:03:22.6936740Z SCRIPT: if [[ "refs/pull/3030/merge" =~ release/* ]]; then export RELEASE=1 export TORCH_VERSION=stable else export RELEASE=0 export TORCH_VERSION=nightly fi # Set env vars from matrix export PYTHON_VERSION=3.9 # Commenting these out for now because the GPU test are not working inside docker export CUDA_ARCH_VERSION=12.8 export CU_VERSION="cu${CUDA_ARCH_VERSION:0:2}${CUDA_ARCH_VERSION:3:1}" # Remove the following line when the GPU tests are working inside docker, and uncomment the above lines #export CU_VERSION="cpu" export TD_GET_DEFAULTS_TO_NONE=1 bash .github/unittest/linux_libs/scripts_habitat/run_all.sh 2025-07-02T08:03:22.6939381Z RUNNER_ARTIFACT_DIR: /home/ec2-user/actions-runner/_work/_temp/artifacts 2025-07-02T08:03:22.6939983Z RUNNER_TEST_RESULTS_DIR: /home/ec2-user/actions-runner/_work/_temp/test-results 2025-07-02T08:03:22.6940557Z RUNNER_DOCS_DIR: /home/ec2-user/actions-runner/_work/_temp/docs 2025-07-02T08:03:22.6941026Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-07-02T08:03:22.6941399Z UPLOAD_ARTIFACT_NAME: 2025-07-02T08:03:22.6941643Z ##[endgroup] 2025-07-02T08:03:22.6981711Z ##[error]An error occurred trying to start process '/usr/bin/bash' with working directory '/home/ec2-user/actions-runner/_work/rl/rl/pytorch/rl'. No such file or directory 2025-07-02T08:03:22.7202987Z Post job cleanup. 2025-07-02T08:03:22.8227007Z [command]/usr/bin/git version 2025-07-02T08:03:22.8278002Z git version 2.47.1 2025-07-02T08:03:22.8329446Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/1609e8f4-8d22-47b2-b702-ce66b26e3422' before making global git config changes 2025-07-02T08:03:22.8330427Z Adding repository directory to the temporary git global config as a safe directory 2025-07-02T08:03:22.8335512Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/rl/rl/test-infra 2025-07-02T08:03:22.8379560Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-07-02T08:03:22.8418319Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-07-02T08:03:22.8805227Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-07-02T08:03:22.8832490Z http.https://github.com/.extraheader 2025-07-02T08:03:22.8845520Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-07-02T08:03:22.8881944Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-07-02T08:03:22.9352004Z A job completed hook has been configured by the self-hosted runner administrator 2025-07-02T08:03:22.9408775Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-07-02T08:03:22.9418763Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-07-02T08:03:22.9419154Z ##[endgroup] 2025-07-02T08:03:22.9538319Z [!ALERT!] Swap in detected! [!ALERT!] 2025-07-02T08:03:34.0327599Z [!ALERT!] Swap out detected [!ALERT!] 2025-07-02T08:03:51.7610471Z Cleaning up orphan processes